monotonicGenes

GSE152418, 34 samples, 18 females, 16 males, aged 23-91. One convalescent sample, and a mixed number of samples in moderate, severe, and ICU grades of COVID-19 infection symptoms. The previous files for how this data was obtained are ‘covid19_GSE152418_76patients_1.Rmd’, and ‘2’ and ‘3’ for updated versions. It says 76 patients, because I recopied some code from a prior study using GSE151161 on COVID-19 symptoms that are similar to Rheumatoid Arthritis patients that had 76 samples but actually only had 38 patients.

(8/22/2020) Note, most of this document assumed the convalescent sample was a convalescent misclassified patient and that all samples are convalescent patients, but this is completely wrong. Convalescent is a medical terminology, to mean those healing or someone who has headed or survived a disease and thus generated antibodies. This sample is classified as a healthy sample in the ML done mid document, because it technically is and the range of values in gene expression are in the range of those healthy. This adds a new angle to look at the data, especially the charts of scatter with the ages, and know that the vitamin and hormone genes looked up are not influenced by convalescent homes giving the patients medicine. They are regular (possible citizens and not incarcerated) patients agreeing to have their blood taken and experimented on. I would never have caught this unless I watched a very annoying video and had to listen to it but had to do some exploring of other diseases to get my mind going on other projects (lime disease and ‘ticks’) where the summary used the ‘convalescing blood’ or similar which made me look it up. Its like a pun that retards the meaning completely.

This script is to test the topic modeling algorithm,linear discriminant analysis (lda), on these wide matrices of genes once we convert them to such. Caret will be used for the lda model.

We can use our functions find25genes and the other geneCards2.R script has.

library(caret)
library(dplyr)
library(tidyr)

monotonicIncrease <- read.csv('monotonicIncrease.csv',sep=',',
                      na.strings=c('',' ','NA'),header=T, stringsAsFactors = F)
monotonicDecrease <- read.csv('monotonicDecrease.csv',sep=',',
                      na.strings=c('',' ','NA'),header=T, stringsAsFactors = F)

HeaderInformation <- read.csv('HeaderInformation.csv',sep=',',
                              header=T, na.strings=c('',' ','NA'),
                              stringsAsFactors = F)

head(HeaderInformation)

##       GSM_ID   CN_old   CN_new age gender diseaseState severity
## 1 GSM4615003 S061_257 healthy1  25      M      Healthy  Healthy
## 2 GSM4615006 S062_258 healthy2  70      F      Healthy  Healthy
## 3 GSM4615008 S063_259 healthy3  68      F      Healthy  Healthy
## 4 GSM4615011 S064_260 healthy4  69      M      Healthy  Healthy
## 5 GSM4615014 S065_261 healthy5  29      F      Healthy  Healthy
## 6 GSM4615016 S066_265 healthy6  90      M      Healthy  Healthy
##   geographicLocation cellType
## 1   Atlanta, GA, USA     PBMC
## 2   Atlanta, GA, USA     PBMC
## 3   Atlanta, GA, USA     PBMC
## 4   Atlanta, GA, USA     PBMC
## 5   Atlanta, GA, USA     PBMC
## 6   Atlanta, GA, USA     PBMC

monotonicIncreaseOrder10 <- monotonicIncrease[order(monotonicIncrease$ICU_health_foldChange,decreasing=T)[1:10],]

monotonicDecreaseOrder10 <- monotonicDecrease[order(monotonicDecrease$ICU_health_foldChange, decreasing=T)[1:10],]


colnames(monotonicIncreaseOrder10)

##  [1] "ENSEMBLID"              "convalescent"           "healthy1"              
##  [4] "healthy2"               "healthy3"               "healthy4"              
##  [7] "healthy5"               "healthy6"               "healthy7"              
## [10] "healthy8"               "healthy9"               "healthy10"             
## [13] "healthy11"              "healthy12"              "healthy13"             
## [16] "healthy14"              "healthy15"              "healthy16"             
## [19] "healthy17"              "moderate1"              "moderate2"             
## [22] "moderate3"              "moderate4"              "severe1"               
## [25] "severe2"                "severe3"                "severe4"               
## [28] "severe5"                "severe6"                "severe7"               
## [31] "severe8"                "ICU1"                   "ICU2"                  
## [34] "ICU3"                   "ICU4"                   "healthy_mean"          
## [37] "moderate_mean"          "severe_mean"            "ICU_mean"              
## [40] "mod_health_foldChange"  "sevr_health_foldChange" "ICU_health_foldChange" 
## [43] "monotonicIncrease"      "monotonicDecrease"

Lets only keep the actual samples for each class and set aside the convalescent sole sample.

convalescentDecr <- monotonicDecreaseOrder10$convalescent
convalescentIncr <- monotonicIncreaseOrder10$convalescent

ensemblID <- monotonicIncreaseOrder10$ENSEMBLID

monotonicIncrease2 <- monotonicIncreaseOrder10[,c(3:35)]
monotonicDecrease2 <- monotonicDecreaseOrder10[,c(3:35)]

sampleType <- c(rep('healthy',17),rep('moderate',4),
                rep('severe',8),rep('ICU',4))

Make the two data frames matrices that flip the samples in the header for the rows and genes as the header from the EnsemblID feature.

tIncr <- as.data.frame(t(monotonicIncrease2))
colnames(tIncr) <- ensemblID
tIncr$sample <- sampleType
IncreaseMx <- tIncr[,c(11,1:10)]

set.seed(8989)

inTrain <- createDataPartition(y=IncreaseMx$sample, p=0.7, list=FALSE)

trainingSet <- IncreaseMx[inTrain,]
testingSet <- IncreaseMx[-inTrain,]
dim(trainingSet)

## [1] 24 11

dim(testingSet)

## [1]  9 11

When using all 8088 genes as tokens for lda, it fails with accuracy and kappa values NA. So it was reduced to the top 10 highest fold change values in the ICU/healthy means. I origninally thought we were using the latent dirichlet allocation model of caret, but it is linear discriminant analysis of caret. This is why it fails. I do want to use the latent dirichlet allocation model, but it won’t be in caret. It is another R package.

IncreaseMx$sample <- as.factor(paste(IncreaseMx$sample))
set.seed(505005)

ldaMod <- train(sample~., method='lda', data=trainingSet,
                preProc = c("center", "scale"),
                trControl=trainControl(method='boot'))
predlda <- predict(ldaMod, testingSet)

DF_ldaI <- data.frame(predlda, type=testingSet$sample)
DF_ldaI

##   predlda     type
## 1 healthy  healthy
## 2 healthy  healthy
## 3 healthy  healthy
## 4 healthy  healthy
## 5 healthy  healthy
## 6  severe moderate
## 7 healthy   severe
## 8     ICU   severe
## 9     ICU      ICU

errored1 <- DF_ldaI[(DF_ldaI$predlda != DF_ldaI$type),]
errored1

##   predlda     type
## 6  severe moderate
## 7 healthy   severe
## 8     ICU   severe

rn <- as.numeric(row.names(errored1))
rn

## [1] 6 7 8

ldaError <- testingSet[rn,]
ldaError

##             sample ENSG00000133742 ENSG00000206177 ENSG00000158578
## moderate1 moderate               3               0               2
## severe1     severe               3               1              10
## severe7     severe             152             136            2513
##           ENSG00000223609 ENSG00000204010 ENSG00000169877 ENSG00000170180
## moderate1               1               3               0               4
## severe1                11               0               4               1
## severe7               330              12             100               3
##           ENSG00000143416 ENSG00000112077 ENSG00000076864
## moderate1               2               2               2
## severe1                 4               0               5
## severe7               270              14              17

Inc_lda_wrong <- row.names(ldaError)

lda_demg <- HeaderInformation[HeaderInformation$CN_new %in% row.names(ldaError),]
lda_demg

##        GSM_ID                  CN_old    CN_new age gender diseaseState
## 13 GSM4614986 S147_nCoV001EUHM-Draw-1 moderate1  75      F     COVID-19
## 14 GSM4614987 S149_nCoV002EUHM-Draw-2   severe1  54      M     COVID-19
## 26 GSM4614999 S178_nCoV028EUHM-Draw-1   severe7  56      M     COVID-19
##    severity geographicLocation cellType
## 13 Moderate   Atlanta, GA, USA     PBMC
## 14   Severe   Atlanta, GA, USA     PBMC
## 26   Severe   Atlanta, GA, USA     PBMC

The three samples misclassified by lda in our 10 most expressed genes are two males aged 54 and 56 and one female aged 75.

When using the lda on only 10 of the most expressed genes in ICU/healthy mean values and genes that increased in expression with each more severe grade of COVID-19, there was an improvement from 5/9 correct to 6/9 for 67% accuracy.

set.seed(605040)
rf_boot <- train(sample~., method='rf', 
               na.action=na.pass,
               data=(trainingSet),  preProc = c("center", "scale"),
               trControl=trainControl(method='boot'), number=5)

predRF_boot <- predict(rf_boot, testingSet)

DF_bootI <- data.frame(predRF_boot, type=testingSet$sample)


head(DF_bootI)

##   predRF_boot     type
## 1     healthy  healthy
## 2     healthy  healthy
## 3     healthy  healthy
## 4     healthy  healthy
## 5     healthy  healthy
## 6     healthy moderate

errored2 <- DF_bootI[(DF_bootI$predRF_boot != DF_bootI$type),]
errored2

##   predRF_boot     type
## 6     healthy moderate
## 7     healthy   severe
## 8         ICU   severe
## 9      severe      ICU

Lets see which samples created an error in our random forest model.

rn <- as.numeric(row.names(errored2))
rn

## [1] 6 7 8 9

rfError <- testingSet[rn,]
rfError

##             sample ENSG00000133742 ENSG00000206177 ENSG00000158578
## moderate1 moderate               3               0               2
## severe1     severe               3               1              10
## severe7     severe             152             136            2513
## ICU2           ICU            4672            2432           27287
##           ENSG00000223609 ENSG00000204010 ENSG00000169877 ENSG00000170180
## moderate1               1               3               0               4
## severe1                11               0               4               1
## severe7               330              12             100               3
## ICU2                 7961            1468            1924              43
##           ENSG00000143416 ENSG00000112077 ENSG00000076864
## moderate1               2               2               2
## severe1                 4               0               5
## severe7               270              14              17
## ICU2                 2790             105             496

RF_demg <- HeaderInformation[HeaderInformation$CN_new %in% row.names(rfError),]
RF_demg

##        GSM_ID                  CN_old    CN_new age gender diseaseState
## 13 GSM4614986 S147_nCoV001EUHM-Draw-1 moderate1  75      F     COVID-19
## 14 GSM4614987 S149_nCoV002EUHM-Draw-2   severe1  54      M     COVID-19
## 20 GSM4614993        S155_nCOV021EUHM      ICU2  60      F     COVID-19
## 26 GSM4614999 S178_nCoV028EUHM-Draw-1   severe7  56      M     COVID-19
##    severity geographicLocation cellType
## 13 Moderate   Atlanta, GA, USA     PBMC
## 14   Severe   Atlanta, GA, USA     PBMC
## 20      ICU   Atlanta, GA, USA     PBMC
## 26   Severe   Atlanta, GA, USA     PBMC

The random forest model on the 10 most expressed genes got 4/9 wrong in the classification with two females aged 60 and 75, and two males aged 54 and 56.

Inc_rf_wrong <- row.names(rfError)

Lets try lda with the 10 least expressed genes with monotonic increases across grades of COVID-19.

tDecr <- as.data.frame(t(monotonicDecrease2))
colnames(tDecr) <- ensemblID
tDecr$sample <- sampleType
DecreaseMx <- tDecr[,c(11,1:10)]

set.seed(8989)

inTrain <- createDataPartition(y=DecreaseMx$sample, p=0.7, list=FALSE)

trainingSet <- DecreaseMx[inTrain,]
testingSet <- DecreaseMx[-inTrain,]
dim(trainingSet)

## [1] 24 11

dim(testingSet)

## [1]  9 11

When using all 8088 genes as tokens for lda, it fails with accuracy and kappa values NA. So it was reduced to the top 10 highest fold change values in the ICU/healthy means.

DecreaseMx$sample <- as.factor(paste(DecreaseMx$sample))
set.seed(505005)

ldaMod <- train(sample~., method='lda', data=trainingSet,
                preProc = c("center", "scale"),
                trControl=trainControl(method='boot'))
predlda <- predict(ldaMod, testingSet)

DF_ldaD <- data.frame(predlda, type=testingSet$sample)
DF_ldaD

##   predlda     type
## 1 healthy  healthy
## 2 healthy  healthy
## 3 healthy  healthy
## 4 healthy  healthy
## 5     ICU  healthy
## 6  severe moderate
## 7     ICU   severe
## 8     ICU   severe
## 9  severe      ICU

errored3 <- DF_ldaD[(DF_ldaD$predlda != DF_ldaD$type),]
errored3

##   predlda     type
## 5     ICU  healthy
## 6  severe moderate
## 7     ICU   severe
## 8     ICU   severe
## 9  severe      ICU

rn <- as.numeric(row.names(errored3))
rn

## [1] 5 6 7 8 9

ldaError <- testingSet[rn,]
ldaError

##             sample ENSG00000133742 ENSG00000206177 ENSG00000158578
## healthy14  healthy              20              68               0
## moderate1 moderate            6807            1052               3
## severe1     severe             104             655               0
## severe7     severe            1131            1817               3
## ICU2           ICU             381             644               3
##           ENSG00000223609 ENSG00000204010 ENSG00000169877 ENSG00000170180
## healthy14               0               0               0             168
## moderate1              20               3               4           13112
## severe1                17               1               1            1573
## severe7               116               1               8            5163
## ICU2                   35               1               0            2891
##           ENSG00000143416 ENSG00000112077 ENSG00000076864
## healthy14               0               1               0
## moderate1              11               0               0
## severe1                 4               1               1
## severe7                 8               0               0
## ICU2                   12               1               2

decr_lda_wrong <- row.names(ldaError)

Lets use random forest on the decreasing genes data to see if it does any better to predict the ratings.

set.seed(605040)
rf_boot <- train(sample~., method='rf', 
               na.action=na.pass,
               data=(trainingSet),  preProc = c("center", "scale"),
               trControl=trainControl(method='boot'), number=5)

predRF_boot <- predict(rf_boot, testingSet)

DF_bootD <- data.frame(predRF_boot, type=testingSet$sample)

head(DF_bootD)

##   predRF_boot     type
## 1     healthy  healthy
## 2     healthy  healthy
## 3     healthy  healthy
## 4     healthy  healthy
## 5     healthy  healthy
## 6      severe moderate

errored4 <- DF_bootD[(DF_bootD$predRF_boot != DF_bootD$type),]
errored4

##   predRF_boot     type
## 6      severe moderate
## 8         ICU   severe
## 9      severe      ICU

Lets see which samples created an error in our random forest model.

rn <- as.numeric(row.names(errored4))
rn

## [1] 6 8 9

rfError <- testingSet[rn,]
rfError

##             sample ENSG00000133742 ENSG00000206177 ENSG00000158578
## moderate1 moderate            6807            1052               3
## severe7     severe            1131            1817               3
## ICU2           ICU             381             644               3
##           ENSG00000223609 ENSG00000204010 ENSG00000169877 ENSG00000170180
## moderate1              20               3               4           13112
## severe7               116               1               8            5163
## ICU2                   35               1               0            2891
##           ENSG00000143416 ENSG00000112077 ENSG00000076864
## moderate1              11               0               0
## severe7                 8               0               0
## ICU2                   12               1               2

decr_rf_wrong <- row.names(rfError)

There was an improvement with the random forest on the decreasing data set’s 10 genes with 6/9 correct. one sample each of moderate, severe and ICU were incorrect.

Lets see the age of those samples and gender with our header information.

RF_demg <- HeaderInformation[HeaderInformation$CN_new %in% row.names(rfError),]
RF_demg

##        GSM_ID                  CN_old    CN_new age gender diseaseState
## 13 GSM4614986 S147_nCoV001EUHM-Draw-1 moderate1  75      F     COVID-19
## 20 GSM4614993        S155_nCOV021EUHM      ICU2  60      F     COVID-19
## 26 GSM4614999 S178_nCoV028EUHM-Draw-1   severe7  56      M     COVID-19
##    severity geographicLocation cellType
## 13 Moderate   Atlanta, GA, USA     PBMC
## 20      ICU   Atlanta, GA, USA     PBMC
## 26   Severe   Atlanta, GA, USA     PBMC

Two females aged 60 and 75 and one male aged 56 were the misclassified samples in the random forest model using the 10 least expressed genes by fold change of ICU/healthy means.

Lets compare which samples were misclassified in our testing set for the 10 least and most expressed genes in rf and lda.

decr_lda_wrong;decr_rf_wrong; Inc_lda_wrong;Inc_rf_wrong

## [1] "healthy14" "moderate1" "severe1"   "severe7"   "ICU2"

## [1] "moderate1" "severe7"   "ICU2"

## [1] "moderate1" "severe1"   "severe7"

## [1] "moderate1" "severe1"   "severe7"   "ICU2"

Ever model misclassified the 1st and 7th severe cases, the 1st moderate case and the 2nd ICU2 case was misclassified in the 10 least expressed genes using lda and the 10 most expressed genes using rf. The 14th healthy case was misclassified in the 10 least expressed genes using the lda model. Lets see the demographics of those five samples.

HeaderInformation[HeaderInformation$CN_new %in% decr_lda_wrong,]

##        GSM_ID                  CN_old    CN_new age gender diseaseState
## 13 GSM4614986 S147_nCoV001EUHM-Draw-1 moderate1  75      F     COVID-19
## 14 GSM4614987 S149_nCoV002EUHM-Draw-2   severe1  54      M     COVID-19
## 20 GSM4614993        S155_nCOV021EUHM      ICU2  60      F     COVID-19
## 26 GSM4614999 S178_nCoV028EUHM-Draw-1   severe7  56      M     COVID-19
## 31 GSM4615034                S183_263 healthy14  23      F      Healthy
##    severity geographicLocation cellType
## 13 Moderate   Atlanta, GA, USA     PBMC
## 14   Severe   Atlanta, GA, USA     PBMC
## 20      ICU   Atlanta, GA, USA     PBMC
## 26   Severe   Atlanta, GA, USA     PBMC
## 31  Healthy   Atlanta, GA, USA     PBMC

This is where recall and precision would come in to play with precision based on number predicted disease over actual disease, and recall is the number of healthy misclassified as disease. We have three classes to choose from.

precision: correctly predicted true positive/(true positive + false positive) recall: correctly predicted true positive/(true positive + false negative)

The rf on 10 most:

DF_bootI

##   predRF_boot     type
## 1     healthy  healthy
## 2     healthy  healthy
## 3     healthy  healthy
## 4     healthy  healthy
## 5     healthy  healthy
## 6     healthy moderate
## 7     healthy   severe
## 8         ICU   severe
## 9      severe      ICU

#healthy precision and recall and accuracy:
5/(5+2)

## [1] 0.7142857

5/(5+0)

## [1] 1

(5+2)/(5+2+2+0)

## [1] 0.7777778

#moderate precision and recall and accuracy:
0/(1+0)

## [1] 0

0/(1)

## [1] 0

(0+8)/(0+8+1+0)

## [1] 0.8888889

#severe pecision and recall and accuracy
0/(2+1)

## [1] 0

0/(0+2)

## [1] 0

(0+7)/(0+6+2+1)

## [1] 0.7777778

#ICU precision and recall and accuracy
0/(0+1)

## [1] 0

0/(0+1)

## [1] 0

(0+7)/(0+7+1+1)

## [1] 0.7777778

Manually the precision, recall, and accuracy was calculated for the four classes in the random forest model on the top 10 genes with a testing subset of 9 samples. The recall on the healthy class was 100% because all healthy classes were correctly identified as true positives, and there were no false negatives or healthy classes misclassified as another class. The precision on the healthy class had 2 false positives, that classified 2 classes as healthy when they weren’t. The accuracy on the moderate class was 88%, but there was only 1 class for the modified and none of the predicted classes was a moderate sample So its true negative rate was high for not classifying any classes as moderate.

Lets make a function specific to our data frames to return the precision, recall, and accuracy of these four classes.

precisionRecallAccuracy <- function(df){
  
 colnames(df) <- c('pred','type')
  df$pred <- as.character(paste(df$pred))
  df$type <- as.character(paste(df$type))
  
 classes <- unique(df$type)
 
 class1a <- as.character(paste(classes[1]))
 class2a <- as.character(paste(classes[2]))
 class3a <- as.character(paste(classes[3]))
 class4a <- as.character(paste(classes[4]))
 
  #correct classes
  class1 <- subset(df, df$type==class1a)
  class2 <- subset(df, df$type==class2a)
  class3 <- subset(df, df$type==class3a)
  class4 <- subset(df, df$type==class4a)
  
  #incorrect classes
  notClass1 <- subset(df,df$type != class1a)
  notClass2 <- subset(df,df$type != class2a)
  notClass3 <- subset(df,df$type != class3a)
  notClass4 <- subset(df, df$type != class4a)
  
  #true positives (real positives predicted positive)
  tp_1 <- sum(class1$pred==class1$type)
  tp_2 <- sum(class2$pred==class2$type)
  tp_3 <- sum(class3$pred==class3$type)
  tp_4 <- sum(class4$pred==class4$type)
  
  #false positives (real negatives predicted positive)
  fp_1 <- sum(notClass1$pred==class1a)
  fp_2 <- sum(notClass2$pred==class2a)
  fp_3 <- sum(notClass3$pred==class3a)
  fp_4 <- sum(notClass4$pred==class4a)
  
  #false negatives (real positive predicted negative)
  fn_1 <- sum(class1$pred!=class1$type)
  fn_2 <- sum(class2$pred!=class2$type)
  fn_3 <- sum(class3$pred!=class3$type)
  fn_4 <- sum(class4$pred!=class4$type)
  
  #true negatives (real negatives predicted negative)
  tn_1 <- sum(notClass1$pred!=class1a)
  tn_2 <- sum(notClass2$pred!=class2a)
  tn_3 <- sum(notClass3$pred!=class3a)
  tn_4 <- sum(notClass4$pred!=class4a)
  
  
  #precision
  p1 <- tp_1/(tp_1+fp_1)
  p2 <- tp_2/(tp_2+fp_2)
  p3 <- tp_3/(tp_3+fp_3)
  p4 <- tp_4/(tp_4+fp_4)
  
  p1 <- ifelse(p1=='NaN',0,p1)
  p2 <- ifelse(p2=='NaN',0,p2)
  p3 <- ifelse(p3=='NaN',0,p3)
  p4 <- ifelse(p4=='NaN',0,p4)
  
  #recall
  r1 <- tp_1/(tp_1+fn_1)
  r2 <- tp_2/(tp_2+fn_2)
  r3 <- tp_3/(tp_3+fn_3)
  r4 <- tp_4/(tp_4+fn_4)
  
  r1 <- ifelse(r1=='NaN',0,r1)
  r2 <- ifelse(r2=='NaN',0,r2)
  r3 <- ifelse(r3=='NaN',0,r3)
  r4 <- ifelse(r4=='NaN',0,r4)
  
  #accuracy
  ac1 <- (tp_1+tn_1)/(tp_1+tn_1+fp_1+fn_1)
  ac2 <- (tp_2+tn_2)/(tp_2+tn_2+fp_2+fn_2)
  ac3 <- (tp_3+tn_3)/(tp_3+tn_3+fp_3+fn_3)
  ac4 <- (tp_4+tn_4)/(tp_4+tn_4+fp_4+fn_4)
  
  table <- as.data.frame(rbind(c(class1a,p1,r1,ac1),
                         c(class2a,p2,r2,ac2),
                         c(class3a,p3,r3,ac3),
                         c(class4a,p4,r4,ac4)))
  
  colnames(table) <- c('class','precision','recall','accuracy')
  
  return(table)
  
  
}

precisionRecallAccuracy(DF_bootI)

##      class         precision recall          accuracy
## 1  healthy 0.714285714285714      1 0.777777777777778
## 2 moderate                 0      0 0.888888888888889
## 3   severe                 0      0 0.666666666666667
## 4      ICU                 0      0 0.777777777777778

Note in the above results how accuracy can be much higher than precision and recall, because if there are 9 samples, and only one class of a sample, than not selecting that class only produces an error in accuracy of 1/9 for that class which is 11% error or 89% accuracy in prediction for that class. As we saw for the ‘moderate’ class when it wasn’t predicted at all but was only in one test sample to predict.

The rf on 10 least:

DF_bootD

##   predRF_boot     type
## 1     healthy  healthy
## 2     healthy  healthy
## 3     healthy  healthy
## 4     healthy  healthy
## 5     healthy  healthy
## 6      severe moderate
## 7      severe   severe
## 8         ICU   severe
## 9      severe      ICU

precisionRecallAccuracy(DF_bootD)

##      class         precision recall          accuracy
## 1  healthy                 1      1                 1
## 2 moderate                 0      0 0.888888888888889
## 3   severe 0.333333333333333    0.5 0.666666666666667
## 4      ICU                 0      0 0.777777777777778

In the above results, the ‘severe’ class only got 1 out of 3 predicted samples as ‘severe’ right so it received a 33% in precision, and out of all classes that were ‘severe’, it only got 1 out of 2 of them correct and missed one class so its recall was 50%.

The lda on 10 most:

DF_ldaI

##   predlda     type
## 1 healthy  healthy
## 2 healthy  healthy
## 3 healthy  healthy
## 4 healthy  healthy
## 5 healthy  healthy
## 6  severe moderate
## 7 healthy   severe
## 8     ICU   severe
## 9     ICU      ICU

precisionRecallAccuracy(DF_ldaI)

##      class         precision recall          accuracy
## 1  healthy 0.833333333333333      1 0.888888888888889
## 2 moderate                 0      0 0.888888888888889
## 3   severe                 0      0 0.666666666666667
## 4      ICU               0.5      1 0.888888888888889

The lda on 10 least:

DF_ldaD

##   predlda     type
## 1 healthy  healthy
## 2 healthy  healthy
## 3 healthy  healthy
## 4 healthy  healthy
## 5     ICU  healthy
## 6  severe moderate
## 7     ICU   severe
## 8     ICU   severe
## 9  severe      ICU

precisionRecallAccuracy(DF_ldaD)

##      class precision recall          accuracy
## 1  healthy         1    0.8 0.888888888888889
## 2 moderate         0      0 0.888888888888889
## 3   severe         0      0 0.555555555555556
## 4      ICU         0      0 0.555555555555556

We know that we need to eliminate some of these genes from our most and least expressed gene set of 10 genes. We will do that later and test our prediction accuracy.

This script was designed to see if the lda model would allow a very wide and short data frame or matrix to be used as a tokenized natural language processing data frame but for genes. Unfortunately, I used the linear discriminant analysis (lda) model of caret and so it would fail because of the curse of dimensionality and having more features than samples. Right from the start our 8089X34 dimension matrix failed the lda modeling for missing accuracy and kappa values. I didn’t tune the model at all or use any hyperparameters other than the center and scaling features of the preprocessing attribute in caret. It could possibly work if those hyperparameters were tuned. But the huge curse of dimensionality type matrix will never work in linear discriminant analysis. It would divide 8800+ features from one data frame into 4 groups with linear boundaries. Imagine how it fails. Latent dirichlet allocation is an NLP algorithm, thats for natural language processing, and this lda kind of model could use in simulating our gene numeric data as integers for counts of ‘words’ known as tokens instead of weighted counts like the Term Frequency-Inverse Document Frequency use of words. https://www.tidytextmining.com/topicmodeling.html is a good source to test out the LDA. I have used it before in other scripts, but it has been a while and thought caret had the package from previous work and information on caret. When I wanted to use linear discriminant analysis I couldn’t find it and used Latent Dirichlet Allocation. Now when I want to use latent dirichlet allocation it’s not the package I thought it was.

Instead, we explored the 10 most and least expressed genes that are also increasing or decreasing monotonically from lowest severe case of COVID-19 to most severe case of COVID-19. The machine learning classification was about the same as our other work immediately preceding this script that used the 20 most and least expressed genes by fold change only. Also, this script didn’t use the functions in our geneFunctions2.R script to merge these genes with the Human Genome Nomenclature (HGNC) gene symbol we are more familiar with instead it kept the Ensembl gene symbol.

We should look at this data in Tableau and see if some genes pop out as useful or not, and also consider those samples that threw off the classifications, and possibly tuning or testing out on different hyperparameters of our ML models to see if we can improve. But doing the resampling could lead to overfitting if we generalize too much it won’t predict the classes well.

Accuracy of the model overall could be improved, the healthy samples naturally had a better precision and recall as they had more samples out of all the classes. A few samples in particular were of two females and a male that were 56 or older, we could try leaving those samples out to see if they are outliers or data that skews the mean values and ultimately the fold change values.

The topics are defined from a library to to use Latent Dirichlet Allocation, the same wide matrices can be used with a grid search and tokenized words used in recommender systems and sentiment review but with random forest, neural nets, and gradient boosted machines. These caret algorithms were very fast compared to how they have behaved with my work in the past. I recommend trying out Latent Dirichlet to model the classes as topics or using the tokenized formats of sentiment analysis and recommender systems’ machine learning algorithms. My next extension of this study will explore those areas. Possibly with Python using reticulate to access Python in R.

I made a couple of Tableau charts and uploaded them to the Tableau Public Server to share.

The first chart is a comparison of the mean values across all three classes of COVID-19 with a highlighted fold change value of the ratio of disease to healthy means and are monotonically decreasing with fold change values at least 30.

Tableau External Window

monotonically Decreasing Genes

The next chart is a chart of the most monotonically increasing fold change expression genes with at least 30 in fold change.

Tableau External Window

monotonically increasing genes

Each link will take you to the image’s chart to hover the details. What is missing is the details of the Ensembl ID’s HGNC gene symbol with annotations for what those genes do. Lets import those charts data that was done earlier and saved to this document’s folder contents.

mDecr <- read.csv('monotonicDecrease_Full_Data_data.csv',sep=',',
                  header=T, na.strings=c('',' ','NA'),
                  stringsAsFactors = F)
head(mDecr)

##   ï..Moderate.Mean Severe.Mean ICU.mean       Ensemblid Healthy.Mean
## 1          2929.50     877.625   443.00 ENSG00000282122   45.3529400
## 2            11.75       7.625     3.50 ENSG00000188403    0.2352941
## 3          1644.00      68.000    41.25 ENSG00000253818   13.0588200
## 4           143.50     143.000    27.25 ENSG00000253709    1.4705880
## 5          1490.00     238.000   209.00 ENSG00000211644   24.2352900
## 6          4823.00    1991.750  1405.75 ENSG00000211950   52.6470600
##   ICU.health.foldChange mod.health.foldChange sevr.health.foldChange
## 1              9.767834              64.59339              19.351010
## 2             14.875000              49.93750              32.406250
## 3              3.158784             125.89190               5.207207
## 4             18.530000              97.58000              97.240000
## 5              8.623786              61.48058               9.820388
## 6             26.701400              91.61006              37.832120

mIncr <- read.csv('monotonicIncrease_Full_Data_data.csv',sep=',',
                  header=T, na.strings=c('',' ','NA'),
                  stringsAsFactors = F)
head(mIncr)

##   ï..Moderate.Mean Severe.Mean ICU.mean       Ensemblid Healthy.Mean
## 1            32.00      51.000   157.00 ENSG00000076864     1.647059
## 2           261.25     338.750  1049.75 ENSG00000143416     8.411765
## 3           300.00     572.125  1989.25 ENSG00000211625    37.294118
## 4             5.00     597.625   600.50 ENSG00000087116     7.352941
## 5             8.25      39.250   157.25 ENSG00000138772     4.941176
## 6            99.25     539.125  1003.00 ENSG00000282651    28.176471
##   ICU.health.foldChange mod.health.foldChange sevr.health.foldChange
## 1              95.32143             19.428571              30.964286
## 2             124.79545             31.057692              40.270979
## 3              53.33951              8.044164              15.340891
## 4              81.66800              0.680000              81.277000
## 5              31.82440              1.669643               7.943452
## 6              35.59708              3.522443              19.133873

Lets change the 1st column name to ‘Moderate.Mean’ in both data frames to remove that symbol changed in downloading the Tableau chart data.

colnames(mDecr)[1] <- 'Moderate.Mean'
colnames(mIncr)[1] <- 'Moderate.Mean'

Now lets source our functions to grab the gene symbols from genecards.org and the annotations.

source('geneCards2.R')

## Warning: package 'rvest' was built under R version 3.6.3

## Loading required package: xml2

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

Lets make a list of the Ensembl genes from both tables. They are not the same genes because if you are monotonically increasing as a gene, you cannot also monotonically decrease from least to most severe cases of COVID-19 sampled fold change values.

decrList <- as.character(mDecr$Ensemblid)
incrList <- as.character(mIncr$Ensemblid)

geneList <- as.character(c(decrList,incrList))
geneList

##  [1] "ENSG00000282122" "ENSG00000188403" "ENSG00000253818" "ENSG00000253709"
##  [5] "ENSG00000211644" "ENSG00000211950" "ENSG00000165949" "ENSG00000076864"
##  [9] "ENSG00000143416" "ENSG00000211625" "ENSG00000087116" "ENSG00000138772"
## [13] "ENSG00000282651" "ENSG00000069535" "ENSG00000170439" "ENSG00000004939"

for (i in geneList){
  find25genes(i)
}

files <- list.files('./gene scrapes')[1:16]
files

for (i in files){
 file <- paste('./gene scrapes',i, sep='/')
 t <- read.csv(file,sep=',',header=F)
 write.table(t,file='monotonic1genes.csv',sep=',',append=T,col.names=F,row.names=F)
}

monotonic1genes <- read.delim('monotonic1genes.csv',header=F,sep=',',na.strings=c('',' ','NA'))
monotonic1genes$V2 <- toupper(monotonic1genes$V2)
colnames(monotonic1genes) <- c('gene','EnsemblGene','DateSourced')
monotonic1genes <- monotonic1genes[,-3]

head(monotonic1genes)

##       gene     EnsemblGene
## 1   SLC4A1 ENSG00000004939
## 2     MAOB ENSG00000069535
## 3  RAP1GAP ENSG00000076864
## 4  ADAMTS2 ENSG00000087116
## 5    ANXA3 ENSG00000138772
## 6 SELENBP1 ENSG00000143416

We will extract the gene summaries now.

for (i in monotonic1genes$gene){
  getSummaries(i,'PBMC')
}

getGeneSummaries('PBMC')

monotonicSumms <- read.csv("proteinGeneSummaries_pbmc.csv",sep=',',
                           header=T, na.strings=c('',' ','NA'),
                           stringsAsFactors = F)
monotonicSumms <- monotonicSumms[,2:5]

head(monotonicSumms)

##       gene
## 1   SLC4A1
## 2     MAOB
## 3  RAP1GAP
## 4  ADAMTS2
## 5    ANXA3
## 6 SELENBP1
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            EntrezSummary
## 1 The protein encoded by this gene is part of the anion exchanger (AE) family and is expressed in the erythrocyte plasma membrane, where it functions as a chloride/bicarbonate exchanger involved in carbon dioxide transport from tissues to lungs. The protein comprises two domains that are structurally and functionally distinct. The N-terminal 40kDa domain is located in the cytoplasm and acts as an attachment site for the red cell skeleton by binding ankyrin. The glycosylated C-terminal membrane-associated domain contains 12-14 membrane spanning segments and carries out the stilbene disulphonate-sensitive exchange transport of anions. The cytoplasmic tail at the extreme C-terminus of the membrane domain binds carbonic anhydrase II. The encoded protein associates with the red cell membrane protein glycophorin A and this association promotes the correct folding and translocation of the exchanger. This protein is predominantly dimeric but forms tetramers in the presence of ankyrin. Many mutations in this gene are known in man, and these mutations can lead to two types of disease: destabilization of red cell membrane leading to hereditary spherocytosis, and defective kidney acid secretion leading to distal renal tubular acidosis. Other mutations that do not give rise to disease result in novel blood group antigens, which form the Diego blood group system. Southeast Asian ovalocytosis (SAO, Melanesian ovalocytosis) results from the heterozygous presence of a deletion in the encoded protein and is common in areas where Plasmodium falciparum malaria is endemic. One null mutation in this gene is known, resulting in very severe anemia and nephrocalcinosis. [provided by RefSeq, Jul 2008]
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The protein encoded by this gene belongs to the flavin monoamine oxidase family. It is a enzyme located in the mitochondrial outer membrane. It catalyzes the oxidative deamination of biogenic and xenobiotic amines and plays an important role in the metabolism of neuroactive and vasoactive amines in the central nervous sysytem and peripheral tissues. This protein preferentially degrades benzylamine and phenylethylamine. [provided by RefSeq, Jul 2008]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    This gene encodes a type of GTPase-activating-protein (GAP) that down-regulates the activity of the ras-related RAP1 protein. RAP1 acts as a molecular switch by cycling between an inactive GDP-bound form and an active GTP-bound form. The product of this gene, RAP1GAP, promotes the hydrolysis of bound GTP and hence returns RAP1 to the inactive state whereas other proteins, guanine nucleotide exchange factors (GEFs), act as RAP1 activators by facilitating the conversion of RAP1 from the GDP- to the GTP-bound form. In general, ras subfamily proteins, such as RAP1, play key roles in receptor-linked signaling pathways that control cell growth and differentiation. RAP1 plays a role in diverse processes such as cell proliferation, adhesion, differentiation, and embryogenesis. Alternative splicing results in multiple transcript variants encoding distinct proteins. [provided by RefSeq, Aug 2011]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      This gene encodes a member of the ADAMTS (a disintegrin and metalloproteinase with thrombospondin motifs) protein family. Members of the family share several distinct protein modules, including a propeptide region, a metalloproteinase domain, a disintegrin-like domain, and a thrombospondin type 1 (TS) motif. Individual members of this family differ in the number of C-terminal TS motifs, and some have unique C-terminal domains. The encoded preproprotein is proteolytically processed to generate the mature procollagen N-proteinase. This proteinase excises the N-propeptide of the fibrillar procollagens types I-III and type V. Mutations in this gene cause Ehlers-Danlos syndrome type VIIC, a recessively inherited connective-tissue disorder. Alternative splicing results in multiple transcript variants, at least one of which encodes an isoform that is proteolytically processed. [provided by RefSeq, Feb 2016]
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          This gene encodes a member of the annexin family.  Members of this calcium-dependent phospholipid-binding protein family play a role in the regulation of cellular growth and in signal transduction pathways.  This protein functions in the inhibition of phopholipase A2 and cleavage of inositol 1,2-cyclic phosphate to form inositol 1-phosphate. This protein may also play a role in anti-coagulation. [provided by RefSeq, Jul 2008]
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        This gene encodes a member of the selenium-binding protein family. Selenium is an essential nutrient that exhibits potent anticarcinogenic properties, and deficiency of selenium may cause certain neurologic diseases. The effects of selenium in preventing cancer and neurologic diseases may be mediated by selenium-binding proteins, and decreased expression of this gene may be associated with several types of cancer. The encoded protein may play a selenium-dependent role in ubiquitination/deubiquitination-mediated protein degradation. Alternatively spliced transcript variants encoding multiple isoforms have been observed for this gene. [provided by RefSeq, Apr 2012]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             GeneCardsSummary
## 1                                             SLC4A1 (Solute Carrier Family 4 Member 1 (Diego Blood Group)) is a Protein Coding gene.                                            Diseases associated with SLC4A1 include Renal Tubular Acidosis, Distal, Autosomal Dominant and Cryohydrocytosis.                                            Among its related pathways are Transport of glucose and other sugars, bile salts and organic acids, metal ions and amine compounds and Neuroscience.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and transporter activity.                                            An important paralog of this gene is SLC4A2.
## 2                                                                                                          MAOB (Monoamine Oxidase B) is a Protein Coding gene.                                            Diseases associated with MAOB include Norrie Disease and Pathological Gambling.                                            Among its related pathways are Tyrosine metabolism and Activated PKN1 stimulates transcription of AR (androgen receptor) regulated genes KLK2 and KLK3.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and electron transfer activity.                                            An important paralog of this gene is MAOA.
## 3                                                                                                                       RAP1GAP (RAP1 GTPase Activating Protein) is a Protein Coding gene.                                            Diseases associated with RAP1GAP include Tuberous Sclerosis and Bleeding Disorder, Platelet-Type, 18.                                            Among its related pathways are Ras signaling pathway and Development Angiotensin activation of ERK.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and GTPase activator activity.                                            An important paralog of this gene is RAP1GAP2.
## 4                                                                                                  ADAMTS2 (ADAM Metallopeptidase With Thrombospondin Type 1 Motif 2) is a Protein Coding gene.                                            Diseases associated with ADAMTS2 include Ehlers-Danlos Syndrome, Dermatosparaxis Type and Ehlers-Danlos Syndrome.                                            Among its related pathways are Degradation of the extracellular matrix and Metabolism of proteins.                                            Gene Ontology (GO) annotations related to this gene include peptidase activity and metallopeptidase activity.                                            An important paralog of this gene is ADAMTS3.
## 5                                                                                                                                                                                                         ANXA3 (Annexin A3) is a Protein Coding gene.                                            Diseases associated with ANXA3 include Ovarian Cancer and Prostate Cancer.                                            Among its related pathways are Prostaglandin Synthesis and Regulation.                                            Gene Ontology (GO) annotations related to this gene include calcium ion binding and calcium-dependent phospholipid binding.                                            An important paralog of this gene is ANXA4.
## 6                                                                                                                                                                                                SELENBP1 (Selenium Binding Protein 1) is a Protein Coding gene.                                            Diseases associated with SELENBP1 include Extraoral Halitosis Due To Methanethiol Oxidase Deficiency and Posteroinferior Myocardial Infarction.                                            Among its related pathways are IL1 and megakaryocytes in obesity and Metabolism.                                            Gene Ontology (GO) annotations related to this gene include selenium binding.                                            
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UniProtKB_Summary
## 1 Functions both as a transporter that mediates electroneutral anion exchange across the cell membrane and as a structural protein. Major integral membrane glycoprotein of the erythrocyte membrane; required for normal flexibility and stability of the erythrocyte membrane and for normal erythrocyte shape via the interactions of its cytoplasmic domain with cytoskeletal proteins, glycolytic enzymes, and hemoglobin. Functions as a transporter that mediates the 1:1 exchange of inorganic anions across the erythrocyte membrane. Mediates chloride-bicarbonate exchange in the kidney, and is required for normal acidification of the urine.\n                         B3AT_HUMAN,P02730\n                         
## 2                                                                                                                                                                                                                                                                                                                                                                             Catalyzes the oxidative deamination of biogenic and xenobiotic amines and has important functions in the metabolism of neuroactive and vasoactive amines in the central nervous system and peripheral tissues. MAOB preferentially degrades benzylamine and phenylethylamine.\n                         AOFB_HUMAN,P27338\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               GTPase activator for the nuclear Ras-related regulatory protein RAP-1A (KREV-1), converting it to the putatively inactive GDP-bound state.\n                         RPGP1_HUMAN,P47736\n                         
## 4                                                                                                                                                                                                                                                                                                                                       Cleaves the propeptides of type I and II collagen prior to fibril assembly (By similarity). Does not act on type III collagen (By similarity). Cleaves lysyl oxidase LOX at a site downstream of its propeptide cleavage site to produce a short LOX form with reduced collagen-binding activity (PubMed:31152061).\n                         ATS2_HUMAN,O95450\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Inhibitor of phospholipase A2, also possesses anti-coagulant properties. Also cleaves the cyclic bond of inositol 1,2-cyclic phosphate to form inositol 1-phosphate.\n                         ANXA3_HUMAN,P12429\n                         
## 6                                                                                                                                                                                                                                                                                                                                Catalyzes the oxidation of methanethiol, an organosulfur compound known to be produced in substantial amounts by gut bacteria (PubMed:29255262). Selenium-binding protein which may be involved in the sensing of reactive xenobiotics in the cytoplasm. May be involved in intra-Golgi protein transport (By similarity).\n                         SBP1_HUMAN,Q13228\n

Merge the two retrieved data frames by gene to add to our numeric data of fold change values.

montcData <- merge(monotonic1genes,monotonicSumms,by.x='gene',
                   by.y='gene')

head(montcData)

##          gene     EnsemblGene
## 1     ADAMTS2 ENSG00000087116
## 2       ANXA3 ENSG00000138772
## 3       IFI27 ENSG00000165949
## 4    IGHV1-14 ENSG00000253709
## 5    IGHV1-24 ENSG00000211950
## 6 IGHV1OR15-9 ENSG00000188403
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       EntrezSummary
## 1 This gene encodes a member of the ADAMTS (a disintegrin and metalloproteinase with thrombospondin motifs) protein family. Members of the family share several distinct protein modules, including a propeptide region, a metalloproteinase domain, a disintegrin-like domain, and a thrombospondin type 1 (TS) motif. Individual members of this family differ in the number of C-terminal TS motifs, and some have unique C-terminal domains. The encoded preproprotein is proteolytically processed to generate the mature procollagen N-proteinase. This proteinase excises the N-propeptide of the fibrillar procollagens types I-III and type V. Mutations in this gene cause Ehlers-Danlos syndrome type VIIC, a recessively inherited connective-tissue disorder. Alternative splicing results in multiple transcript variants, at least one of which encodes an isoform that is proteolytically processed. [provided by RefSeq, Feb 2016]
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     This gene encodes a member of the annexin family.  Members of this calcium-dependent phospholipid-binding protein family play a role in the regulation of cellular growth and in signal transduction pathways.  This protein functions in the inhibition of phopholipase A2 and cleavage of inositol 1,2-cyclic phosphate to form inositol 1-phosphate. This protein may also play a role in anti-coagulation. [provided by RefSeq, Jul 2008]
## 3                                                                                                                                                                                                                                                                                                                                     IFI27 (Interferon Alpha Inducible Protein 27) is a Protein Coding gene.                                            Diseases associated with IFI27 include Hepatitis C Virus and Oral Leukoplakia.                                            Among its related pathways are Interferon gamma signaling and Innate Immune System.                                            Gene Ontology (GO) annotations related to this gene include RNA polymerase II activating transcription factor binding and lamin binding.                                            An important paralog of this gene is IFI27L2.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       IGHV1-14 (Immunoglobulin Heavy Variable 1-14 (Pseudogene)) is a Pseudogene.                                                                                                                                                                                
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           IGHV1-24 (Immunoglobulin Heavy Variable 1-24) is a Protein Coding gene.                                                                                                                                                                                An important paralog of this gene is IGHV1-69-2.
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           IGHV1OR15-9 (Immunoglobulin Heavy Variable 1/OR15-9 (Non-Functional)) is a Pseudogene.                                                                                                                                                                                An important paralog of this gene is IGHV1OR21-1.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        GeneCardsSummary
## 1                                             ADAMTS2 (ADAM Metallopeptidase With Thrombospondin Type 1 Motif 2) is a Protein Coding gene.                                            Diseases associated with ADAMTS2 include Ehlers-Danlos Syndrome, Dermatosparaxis Type and Ehlers-Danlos Syndrome.                                            Among its related pathways are Degradation of the extracellular matrix and Metabolism of proteins.                                            Gene Ontology (GO) annotations related to this gene include peptidase activity and metallopeptidase activity.                                            An important paralog of this gene is ADAMTS3.
## 2                                                                                                                                                    ANXA3 (Annexin A3) is a Protein Coding gene.                                            Diseases associated with ANXA3 include Ovarian Cancer and Prostate Cancer.                                            Among its related pathways are Prostaglandin Synthesis and Regulation.                                            Gene Ontology (GO) annotations related to this gene include calcium ion binding and calcium-dependent phospholipid binding.                                            An important paralog of this gene is ANXA4.
## 3                                                                                         IFI27 (Interferon Alpha Inducible Protein 27) is a Protein Coding gene.                                            Diseases associated with IFI27 include Hepatitis C Virus and Oral Leukoplakia.                                            Among its related pathways are Interferon gamma signaling and Innate Immune System.                                            Gene Ontology (GO) annotations related to this gene include RNA polymerase II activating transcription factor binding and lamin binding.                                            An important paralog of this gene is IFI27L2.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                           IGHV1-14 (Immunoglobulin Heavy Variable 1-14 (Pseudogene)) is a Pseudogene.                                                                                                                                                                                
## 5                                                                                                                                                                                                                                                                                                                                                                                               IGHV1-24 (Immunoglobulin Heavy Variable 1-24) is a Protein Coding gene.                                                                                                                                                                                An important paralog of this gene is IGHV1-69-2.
## 6                                                                                                                                                                                                                                                                                                                                                                               IGHV1OR15-9 (Immunoglobulin Heavy Variable 1/OR15-9 (Non-Functional)) is a Pseudogene.                                                                                                                                                                                An important paralog of this gene is IGHV1OR21-1.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Cleaves the propeptides of type I and II collagen prior to fibril assembly (By similarity). Does not act on type III collagen (By similarity). Cleaves lysyl oxidase LOX at a site downstream of its propeptide cleavage site to produce a short LOX form with reduced collagen-binding activity (PubMed:31152061).\n                         ATS2_HUMAN,O95450\n                         
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Inhibitor of phospholipase A2, also possesses anti-coagulant properties. Also cleaves the cyclic bond of inositol 1,2-cyclic phosphate to form inositol 1-phosphate.\n                         ANXA3_HUMAN,P12429\n                         
## 3 Probable adapter protein involved in different biological processes (PubMed:22427340, PubMed:27194766). Part of the signaling pathways that lead to apoptosis (PubMed:18330707, PubMed:27673746, PubMed:24970806). Involved in type-I interferon-induced apoptosis characterized by a rapid and robust release of cytochrome C from the mitochondria and activation of BAX and caspases 2, 3, 6, 8 and 9 (PubMed:18330707, PubMed:27673746). Also functions in TNFSF10-induced apoptosis (PubMed:24970806). May also have a function in the nucleus, where it may be involved in the interferon-induced negative regulation of the transcriptional activity of NR4A1, NR4A2 and NR4A3 through the enhancement of XPO1-mediated nuclear export of these nuclear receptors (PubMed:22427340). May thereby play a role in the vascular response to injury (By similarity). In the innate immune response, has an antiviral activity towards hepatitis C virus/HCV (PubMed:27194766, PubMed:27777077). May prevent the replication of the virus by recruiting both the hepatitis C virus non-structural protein 5A/NS5A and the ubiquitination machinery via SKP2, promoting the ubiquitin-mediated proteasomal degradation of NS5A (PubMed:27194766, PubMed:27777077).\n                         IFI27_HUMAN,P40305\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  no summary
## 5                                                              V region of the variable domain of immunoglobulin heavy chains that participates in the antigen recognition (PubMed:24600447). Immunoglobulins, also known as antibodies, are membrane-bound or secreted glycoproteins produced by B lymphocytes. In the recognition phase of humoral immunity, the membrane-bound immunoglobulins serve as receptors which, upon binding of a specific antigen, trigger the clonal expansion and differentiation of B lymphocytes into immunoglobulins-secreting plasma cells. Secreted immunoglobulins mediate the effector phase of humoral immunity, which results in the elimination of bound antigens (PubMed:22158414, PubMed:20176268). The antigen binding site is formed by the variable domain of one heavy chain, together with that of its associated light chain. Thus, each immunoglobulin has two antigen binding sites with remarkable affinity for a particular antigen. The variable domains are assembled by a process called V-(D)-J rearrangement and can then be subjected to somatic hypermutations which, after exposure to antigen and selection, allow affinity maturation for a particular antigen (PubMed:20176268, PubMed:17576170).\n                         HV124_HUMAN,A0A0C4DH33\n                         
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  no summary

Now combine to numeric fold change data.

montcData2 <- rbind(mIncr,mDecr)
montcData3 <- merge(montcData,montcData2, by.y='Ensemblid',
                    by.x='EnsemblGene')
head(montcData3)

##       EnsemblGene     gene
## 1 ENSG00000004939   SLC4A1
## 2 ENSG00000069535     MAOB
## 3 ENSG00000076864  RAP1GAP
## 4 ENSG00000087116  ADAMTS2
## 5 ENSG00000138772    ANXA3
## 6 ENSG00000143416 SELENBP1
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            EntrezSummary
## 1 The protein encoded by this gene is part of the anion exchanger (AE) family and is expressed in the erythrocyte plasma membrane, where it functions as a chloride/bicarbonate exchanger involved in carbon dioxide transport from tissues to lungs. The protein comprises two domains that are structurally and functionally distinct. The N-terminal 40kDa domain is located in the cytoplasm and acts as an attachment site for the red cell skeleton by binding ankyrin. The glycosylated C-terminal membrane-associated domain contains 12-14 membrane spanning segments and carries out the stilbene disulphonate-sensitive exchange transport of anions. The cytoplasmic tail at the extreme C-terminus of the membrane domain binds carbonic anhydrase II. The encoded protein associates with the red cell membrane protein glycophorin A and this association promotes the correct folding and translocation of the exchanger. This protein is predominantly dimeric but forms tetramers in the presence of ankyrin. Many mutations in this gene are known in man, and these mutations can lead to two types of disease: destabilization of red cell membrane leading to hereditary spherocytosis, and defective kidney acid secretion leading to distal renal tubular acidosis. Other mutations that do not give rise to disease result in novel blood group antigens, which form the Diego blood group system. Southeast Asian ovalocytosis (SAO, Melanesian ovalocytosis) results from the heterozygous presence of a deletion in the encoded protein and is common in areas where Plasmodium falciparum malaria is endemic. One null mutation in this gene is known, resulting in very severe anemia and nephrocalcinosis. [provided by RefSeq, Jul 2008]
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  The protein encoded by this gene belongs to the flavin monoamine oxidase family. It is a enzyme located in the mitochondrial outer membrane. It catalyzes the oxidative deamination of biogenic and xenobiotic amines and plays an important role in the metabolism of neuroactive and vasoactive amines in the central nervous sysytem and peripheral tissues. This protein preferentially degrades benzylamine and phenylethylamine. [provided by RefSeq, Jul 2008]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    This gene encodes a type of GTPase-activating-protein (GAP) that down-regulates the activity of the ras-related RAP1 protein. RAP1 acts as a molecular switch by cycling between an inactive GDP-bound form and an active GTP-bound form. The product of this gene, RAP1GAP, promotes the hydrolysis of bound GTP and hence returns RAP1 to the inactive state whereas other proteins, guanine nucleotide exchange factors (GEFs), act as RAP1 activators by facilitating the conversion of RAP1 from the GDP- to the GTP-bound form. In general, ras subfamily proteins, such as RAP1, play key roles in receptor-linked signaling pathways that control cell growth and differentiation. RAP1 plays a role in diverse processes such as cell proliferation, adhesion, differentiation, and embryogenesis. Alternative splicing results in multiple transcript variants encoding distinct proteins. [provided by RefSeq, Aug 2011]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      This gene encodes a member of the ADAMTS (a disintegrin and metalloproteinase with thrombospondin motifs) protein family. Members of the family share several distinct protein modules, including a propeptide region, a metalloproteinase domain, a disintegrin-like domain, and a thrombospondin type 1 (TS) motif. Individual members of this family differ in the number of C-terminal TS motifs, and some have unique C-terminal domains. The encoded preproprotein is proteolytically processed to generate the mature procollagen N-proteinase. This proteinase excises the N-propeptide of the fibrillar procollagens types I-III and type V. Mutations in this gene cause Ehlers-Danlos syndrome type VIIC, a recessively inherited connective-tissue disorder. Alternative splicing results in multiple transcript variants, at least one of which encodes an isoform that is proteolytically processed. [provided by RefSeq, Feb 2016]
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          This gene encodes a member of the annexin family.  Members of this calcium-dependent phospholipid-binding protein family play a role in the regulation of cellular growth and in signal transduction pathways.  This protein functions in the inhibition of phopholipase A2 and cleavage of inositol 1,2-cyclic phosphate to form inositol 1-phosphate. This protein may also play a role in anti-coagulation. [provided by RefSeq, Jul 2008]
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        This gene encodes a member of the selenium-binding protein family. Selenium is an essential nutrient that exhibits potent anticarcinogenic properties, and deficiency of selenium may cause certain neurologic diseases. The effects of selenium in preventing cancer and neurologic diseases may be mediated by selenium-binding proteins, and decreased expression of this gene may be associated with several types of cancer. The encoded protein may play a selenium-dependent role in ubiquitination/deubiquitination-mediated protein degradation. Alternatively spliced transcript variants encoding multiple isoforms have been observed for this gene. [provided by RefSeq, Apr 2012]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             GeneCardsSummary
## 1                                             SLC4A1 (Solute Carrier Family 4 Member 1 (Diego Blood Group)) is a Protein Coding gene.                                            Diseases associated with SLC4A1 include Renal Tubular Acidosis, Distal, Autosomal Dominant and Cryohydrocytosis.                                            Among its related pathways are Transport of glucose and other sugars, bile salts and organic acids, metal ions and amine compounds and Neuroscience.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and transporter activity.                                            An important paralog of this gene is SLC4A2.
## 2                                                                                                          MAOB (Monoamine Oxidase B) is a Protein Coding gene.                                            Diseases associated with MAOB include Norrie Disease and Pathological Gambling.                                            Among its related pathways are Tyrosine metabolism and Activated PKN1 stimulates transcription of AR (androgen receptor) regulated genes KLK2 and KLK3.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and electron transfer activity.                                            An important paralog of this gene is MAOA.
## 3                                                                                                                       RAP1GAP (RAP1 GTPase Activating Protein) is a Protein Coding gene.                                            Diseases associated with RAP1GAP include Tuberous Sclerosis and Bleeding Disorder, Platelet-Type, 18.                                            Among its related pathways are Ras signaling pathway and Development Angiotensin activation of ERK.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and GTPase activator activity.                                            An important paralog of this gene is RAP1GAP2.
## 4                                                                                                  ADAMTS2 (ADAM Metallopeptidase With Thrombospondin Type 1 Motif 2) is a Protein Coding gene.                                            Diseases associated with ADAMTS2 include Ehlers-Danlos Syndrome, Dermatosparaxis Type and Ehlers-Danlos Syndrome.                                            Among its related pathways are Degradation of the extracellular matrix and Metabolism of proteins.                                            Gene Ontology (GO) annotations related to this gene include peptidase activity and metallopeptidase activity.                                            An important paralog of this gene is ADAMTS3.
## 5                                                                                                                                                                                                         ANXA3 (Annexin A3) is a Protein Coding gene.                                            Diseases associated with ANXA3 include Ovarian Cancer and Prostate Cancer.                                            Among its related pathways are Prostaglandin Synthesis and Regulation.                                            Gene Ontology (GO) annotations related to this gene include calcium ion binding and calcium-dependent phospholipid binding.                                            An important paralog of this gene is ANXA4.
## 6                                                                                                                                                                                                SELENBP1 (Selenium Binding Protein 1) is a Protein Coding gene.                                            Diseases associated with SELENBP1 include Extraoral Halitosis Due To Methanethiol Oxidase Deficiency and Posteroinferior Myocardial Infarction.                                            Among its related pathways are IL1 and megakaryocytes in obesity and Metabolism.                                            Gene Ontology (GO) annotations related to this gene include selenium binding.                                            
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  UniProtKB_Summary
## 1 Functions both as a transporter that mediates electroneutral anion exchange across the cell membrane and as a structural protein. Major integral membrane glycoprotein of the erythrocyte membrane; required for normal flexibility and stability of the erythrocyte membrane and for normal erythrocyte shape via the interactions of its cytoplasmic domain with cytoskeletal proteins, glycolytic enzymes, and hemoglobin. Functions as a transporter that mediates the 1:1 exchange of inorganic anions across the erythrocyte membrane. Mediates chloride-bicarbonate exchange in the kidney, and is required for normal acidification of the urine.\n                         B3AT_HUMAN,P02730\n                         
## 2                                                                                                                                                                                                                                                                                                                                                                             Catalyzes the oxidative deamination of biogenic and xenobiotic amines and has important functions in the metabolism of neuroactive and vasoactive amines in the central nervous system and peripheral tissues. MAOB preferentially degrades benzylamine and phenylethylamine.\n                         AOFB_HUMAN,P27338\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               GTPase activator for the nuclear Ras-related regulatory protein RAP-1A (KREV-1), converting it to the putatively inactive GDP-bound state.\n                         RPGP1_HUMAN,P47736\n                         
## 4                                                                                                                                                                                                                                                                                                                                       Cleaves the propeptides of type I and II collagen prior to fibril assembly (By similarity). Does not act on type III collagen (By similarity). Cleaves lysyl oxidase LOX at a site downstream of its propeptide cleavage site to produce a short LOX form with reduced collagen-binding activity (PubMed:31152061).\n                         ATS2_HUMAN,O95450\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Inhibitor of phospholipase A2, also possesses anti-coagulant properties. Also cleaves the cyclic bond of inositol 1,2-cyclic phosphate to form inositol 1-phosphate.\n                         ANXA3_HUMAN,P12429\n                         
## 6                                                                                                                                                                                                                                                                                                                                Catalyzes the oxidation of methanethiol, an organosulfur compound known to be produced in substantial amounts by gut bacteria (PubMed:29255262). Selenium-binding protein which may be involved in the sensing of reactive xenobiotics in the cytoplasm. May be involved in intra-Golgi protein transport (By similarity).\n                         SBP1_HUMAN,Q13228\n                         
##   Moderate.Mean Severe.Mean ICU.mean Healthy.Mean ICU.health.foldChange
## 1        193.75     244.750   491.50     6.529412              75.27477
## 2          5.50      61.375   209.75     4.705882              44.57187
## 3         32.00      51.000   157.00     1.647059              95.32143
## 4          5.00     597.625   600.50     7.352941              81.66800
## 5          8.25      39.250   157.25     4.941176              31.82440
## 6        261.25     338.750  1049.75     8.411765             124.79545
##   mod.health.foldChange sevr.health.foldChange
## 1             29.673423              37.484234
## 2              1.168750              13.042188
## 3             19.428571              30.964286
## 4              0.680000              81.277000
## 5              1.669643               7.943452
## 6             31.057692              40.270979

Lets write this out to csv and create the same charts but add the annotations and gene symbol.

write.csv(montcData3,'monotonicIncDecrSumms.csv',row.names=FALSE)

Now that I have the new data with summary annotations with the gene symbol instead of the Ensembl ID, I have made the Tableau charts and uploaded them to Tableau Public Server.

The first chart is the monotonically decreasing genes selected with moderate fold change values greater than 20. Hovering on each bar and disease state will give the mean value, fold change value also on the labels respective to each disease state, and the Entrez gene summary for the gene. There is an image of the chart without hovering over any bar on the barchart, and also an image of the chart when hovering over any bar on the barchart for that gene and disease state.

Tableau External Window

monotonic decreasing moderate fold change greater than 30

The updated 2nd chart as a new name is the gene symbol and Entrez gene summary annotation for each of these genes shown having monotonically increasing fold change values from least to most severe COVID-19 cases using the ICU fold change greater than 30.

monotonically increasing genes ICU fold change greater than 30

monotonically increasing genes ICU fold change greater than 30 hovered over for details

Next, Lets see if we can use machine learning with just the random forest model and three of the top increasing and decreasing genes to classify the class of COVID-19 severity.

Lets create some conditional features to this data.

montcData3$percentBW_MSI <- ifelse((montcData3$mod.health.foldChange>1.5*
                          montcData3$sevr.health.foldChange & 
                          montcData3$sevr.health.foldChange>1.5*
                          montcData3$ICU.health.foldChange),1,0)
montcData3$percentBW_ISM <- ifelse((montcData3$mod.health.foldChange<.5*
                          montcData3$sevr.health.foldChange & 
                          montcData3$sevr.health.foldChange<.5*
                          montcData3$ICU.health.foldChange),1,0)

We have our three least and most expressed genes, lets now get those genes.

least3 <- montcData3$gene[montcData3$percentBW_MSI==1]
most3 <- montcData3$gene[montcData3$percentBW_ISM==1]

Our ML genes are:

ML_genes <- c(as.character(paste(least3)),as.character(paste(most3)))
ML_genes

## [1] "IGHV1OR15-9" "IGLV1-41"    "IGHV7-4-1"   "MAOB"        "ANXA3"      
## [6] "METTL7B"

ML_DF_leastMost <- subset(montcData3,montcData3$gene %in% ML_genes)
ML_ensmbl_LstMst <- ML_DF_leastMost[,1:2]
ML_ensmbl_LstMst

##        EnsemblGene        gene
## 2  ENSG00000069535        MAOB
## 5  ENSG00000138772       ANXA3
## 8  ENSG00000170439     METTL7B
## 9  ENSG00000188403 IGHV1OR15-9
## 14 ENSG00000253818    IGLV1-41
## 15 ENSG00000282122   IGHV7-4-1

Lets import the original data, when extracting these genes in Tableau using the filters, it wasn’t the same filtering principles used in the creation of our 10 most and least monotonic genes by fold change.

originalDF <- read.delim('GSE152418_p20047_Study1_RawCounts.txt',sep='\t',
                       header=T, na.strings=c('',' ','NA'),
                       stringsAsFactors = F)

colnames(originalDF)

##  [1] "ENSEMBLID"                "S145_nCOV001_C"          
##  [3] "S147_nCoV001EUHM.Draw.1"  "S149_nCoV002EUHM.Draw.2" 
##  [5] "S150_nCoV003EUHM.Draw.1"  "S151_nCoV004EUHM.Draw.1" 
##  [7] "S152_nCoV006EUHM.Draw.1"  "S153_nCoV007EUHM.Draw.1" 
##  [9] "S154_nCoV0010EUHM.Draw.1" "S155_nCOV021EUHM"        
## [11] "S156_nCOV024EUHM.Draw.1"  "S157_nCOV0029EUHM"       
## [13] "S175_nCoV024EUHM.Draw.2"  "S176_nCoV025EUHM.Draw.1" 
## [15] "S177_nCoV025EUHM.Draw.2"  "S178_nCoV028EUHM.Draw.1" 
## [17] "S179_nCoV033EUHM.Draw.1"  "S180_nCoV034EUHM.Draw.1" 
## [19] "S061_257"                 "S062_258"                
## [21] "S063_259"                 "S064_260"                
## [23] "S065_261"                 "S066_265"                
## [25] "S067_270"                 "S068_272"                
## [27] "S069_273"                 "S070_279"                
## [29] "S071_280"                 "S181_255"                
## [31] "S182_SHXA10"              "S183_263"                
## [33] "S184_SHXA18"              "S185_266"                
## [35] "S186_SHXA14"

Lets look at our header information on the demographics of each samples again.

HeaderInformation

##        GSM_ID                   CN_old       CN_new age gender diseaseState
## 1  GSM4615003                 S061_257     healthy1  25      M      Healthy
## 2  GSM4615006                 S062_258     healthy2  70      F      Healthy
## 3  GSM4615008                 S063_259     healthy3  68      F      Healthy
## 4  GSM4615011                 S064_260     healthy4  69      M      Healthy
## 5  GSM4615014                 S065_261     healthy5  29      F      Healthy
## 6  GSM4615016                 S066_265     healthy6  90      M      Healthy
## 7  GSM4615019                 S067_270     healthy7  85      F      Healthy
## 8  GSM4615022                 S068_272     healthy8  28      M      Healthy
## 9  GSM4615025                 S069_273     healthy9  26      F      Healthy
## 10 GSM4615027                 S070_279    healthy10  38      M      Healthy
## 11 GSM4615030                 S071_280    healthy11  84      F      Healthy
## 12 GSM4614985           S145_nCOV001_C convalescent  80      M Convalescent
## 13 GSM4614986  S147_nCoV001EUHM-Draw-1    moderate1  75      F     COVID-19
## 14 GSM4614987  S149_nCoV002EUHM-Draw-2      severe1  54      M     COVID-19
## 15 GSM4614988  S150_nCoV003EUHM-Draw-1      severe2  75      F     COVID-19
## 16 GSM4614989  S151_nCoV004EUHM-Draw-1         ICU1  59      M     COVID-19
## 17 GSM4614990  S152_nCoV006EUHM-Draw-1      severe3  59      M     COVID-19
## 18 GSM4614991  S153_nCoV007EUHM-Draw-1    moderate2  53      F     COVID-19
## 19 GSM4614992 S154_nCoV0010EUHM-Draw-1      severe4  64      F     COVID-19
## 20 GSM4614993         S155_nCOV021EUHM         ICU2  60      F     COVID-19
## 21 GSM4614994  S156_nCOV024EUHM-Draw-1      severe5  48      M     COVID-19
## 22 GSM4614995        S157_nCOV0029EUHM    moderate3  47      F     COVID-19
## 23 GSM4614996  S175_nCoV024EUHM-Draw-2         ICU3  48      M     COVID-19
## 24 GSM4614997  S176_nCoV025EUHM-Draw-1      severe6  56      F     COVID-19
## 25 GSM4614998  S177_nCoV025EUHM-Draw-2         ICU4  56      F     COVID-19
## 26 GSM4614999  S178_nCoV028EUHM-Draw-1      severe7  56      M     COVID-19
## 27 GSM4615000  S179_nCoV033EUHM-Draw-1      severe8  76      M     COVID-19
## 28 GSM4615001  S180_nCoV034EUHM-Draw-1    moderate4  46      F     COVID-19
## 29 GSM4615032                 S181_255    healthy12  27      M      Healthy
## 30 GSM4615033              S182_SHXA10    healthy13  50      F      Healthy
## 31 GSM4615034                 S183_263    healthy14  23      F      Healthy
## 32 GSM4615035              S184_SHXA18    healthy15  55      M      Healthy
## 33 GSM4615036                 S185_266    healthy16  91      F      Healthy
## 34 GSM4615037              S186_SHXA14    healthy17  57      M      Healthy
##        severity geographicLocation cellType
## 1       Healthy   Atlanta, GA, USA     PBMC
## 2       Healthy   Atlanta, GA, USA     PBMC
## 3       Healthy   Atlanta, GA, USA     PBMC
## 4       Healthy   Atlanta, GA, USA     PBMC
## 5       Healthy   Atlanta, GA, USA     PBMC
## 6       Healthy   Atlanta, GA, USA     PBMC
## 7       Healthy   Atlanta, GA, USA     PBMC
## 8       Healthy   Atlanta, GA, USA     PBMC
## 9       Healthy   Atlanta, GA, USA     PBMC
## 10      Healthy   Atlanta, GA, USA     PBMC
## 11      Healthy   Atlanta, GA, USA     PBMC
## 12 Convalescent   Atlanta, GA, USA     PBMC
## 13     Moderate   Atlanta, GA, USA     PBMC
## 14       Severe   Atlanta, GA, USA     PBMC
## 15       Severe   Atlanta, GA, USA     PBMC
## 16          ICU   Atlanta, GA, USA     PBMC
## 17       Severe   Atlanta, GA, USA     PBMC
## 18     Moderate   Atlanta, GA, USA     PBMC
## 19       Severe   Atlanta, GA, USA     PBMC
## 20          ICU   Atlanta, GA, USA     PBMC
## 21       Severe   Atlanta, GA, USA     PBMC
## 22     Moderate   Atlanta, GA, USA     PBMC
## 23          ICU   Atlanta, GA, USA     PBMC
## 24       Severe   Atlanta, GA, USA     PBMC
## 25          ICU   Atlanta, GA, USA     PBMC
## 26       Severe   Atlanta, GA, USA     PBMC
## 27       Severe   Atlanta, GA, USA     PBMC
## 28     Moderate   Atlanta, GA, USA     PBMC
## 29      Healthy   Atlanta, GA, USA     PBMC
## 30      Healthy   Atlanta, GA, USA     PBMC
## 31      Healthy   Atlanta, GA, USA     PBMC
## 32      Healthy   Atlanta, GA, USA     PBMC
## 33      Healthy   Atlanta, GA, USA     PBMC
## 34      Healthy   Atlanta, GA, USA     PBMC

cn <- as.data.frame(colnames(originalDF)[2:35])
colnames(cn) <- 'originalNames'
cn$originalNames <- gsub('[.]','-',cn$originalNames,perl=T)
cn2 <- HeaderInformation[,2:3]
newColumnNames <- merge(cn,cn2,by.x='originalNames',by.y='CN_old')

change the names back once merged to merge back with originalDF and new names.

newColumnNames$originalNames <- gsub('-','.',
                                     newColumnNames$originalNames,
                                     perl=T)

DF1 <- merge(ML_ensmbl_LstMst,originalDF,by.x='EnsemblGene',
             by.y='ENSEMBLID')
DF2 <- DF1[,-1]
row.names(DF2) <- DF2$gene
DF3 <- DF2[,-1]
DF4 <- as.data.frame(t(DF3))
DF4$sample1 <- row.names(DF4)
DF5 <- merge(DF4,newColumnNames, by.y='originalNames',by.x='sample1')
row.names(DF5) <- DF5$sample1
ML_dataframe3s <- DF5[,-1]
head(ML_dataframe3s)

##          MAOB ANXA3 METTL7B IGHV1OR15-9 IGLV1-41 IGHV7-4-1   CN_new
## S061_257    3    14       1           0        8         4 healthy1
## S062_258    8     5       4           0        0         1 healthy2
## S063_259   11     5       8           1       24         0 healthy3
## S064_260    6     2       3           0        2         2 healthy4
## S065_261    8     3      24           0       70       606 healthy5
## S066_265    3     8      31           0        1        29 healthy6

aliases <- ML_dataframe3s$CN_new
ML_dataframe3s$CN_new <- gsub('[0-9]','',ML_dataframe3s$CN_new)

Now lets see how this new set of genes is at predicting our classes of COVID-19. Lets remove the convalescent class first.

ML_dataframe3s$CN_new <- as.factor(paste(ML_dataframe3s$CN_new))
row.names(ML_dataframe3s) <- NULL

set.seed(9875)

inTrain <- createDataPartition(y=ML_dataframe3s$CN_new, p=0.7, list=FALSE)

trainingSet <- ML_dataframe3s[inTrain,]
testingSet <- ML_dataframe3s[-inTrain,]
dim(trainingSet)

## [1] 25  7

dim(testingSet)

## [1] 9 7

Lets first test how well lda predicts the convalescent sample from training all the data samples.

convalescent <- subset(ML_dataframe3s, ML_dataframe3s$CN_new=='convalescent')

set.seed(505005)

ldaMod <- train(CN_new~., method='lda', data=ML_dataframe3s,
                trControl=trainControl(method='boot'))
predlda <- predict(ldaMod, convalescent)

DF_ldaConvalescent <- data.frame(predlda, type=convalescent$CN_new)
DF_ldaConvalescent

##   predlda         type
## 1 healthy convalescent

precisionRecallAccuracy(DF_ldaConvalescent)

##          class precision recall accuracy
## 1 convalescent         0      0        0
## 2           NA         0      0        1
## 3           NA         0      0        1
## 4           NA         0      0        1

Given that the LDA model predicted healthy when it did have data to train on only 1 out of 34 samples having this class, the model predicted the class of the convalescent class to be healthy. Maybe it is a misclassification of a sample. The above shows the accuracy measures. Lets try the random forest model too.

set.seed(505005)

rfMod <- train(CN_new~., method='rf', data=ML_dataframe3s,
                trControl=trainControl(method='boot'))
predrf <- predict(rfMod, convalescent)

DF_rfConvalescent <- data.frame(predrf, type=convalescent$CN_new)
DF_rfConvalescent

##         predrf         type
## 1 convalescent convalescent

Interestingly the random forest model, predicted the convalescent class to be its own class of ‘convalescent’ and it also had one sample of ‘convalescent’ to train on this class. Likely it is because the values in this sample are identical to the values in the testing set because it is the exact sample.

precisionRecallAccuracy(DF_rfConvalescent)

##          class precision recall accuracy
## 1 convalescent         1      1        1
## 2           NA         0      0        1
## 3           NA         0      0        1
## 4           NA         0      0        1

The above measures show that the precision, recall, and accuracy were 100%. What if we made the testing sample a multicollinear version of itself and test the rf model? Lets multiply it by e.

convalescent2 <- convalescent
convalescent2[,1:6] <- exp(convalescent2[,1:6])

set.seed(505005)

rfMod <- train(CN_new~., method='rf', data=ML_dataframe3s,
                trControl=trainControl(method='boot'))
predrf <- predict(rfMod, convalescent2)

DF_rfConvalescent2 <- data.frame(predrf, type=convalescent2$CN_new)
DF_rfConvalescent2

##     predrf         type
## 1 moderate convalescent

When we changed the convalescent sample to the exponential function or natural number, e, of itself, the random forest classified this sample as moderate. We will assume the sample is healthy or moderate and not its own class of convalescent.

If we remove the convalescent sample to train on in the random forest model, will it predict moderate or healthy or a more severe form? Lets also change the class to moderate in the convalescent testing set.

MLDF <- subset(ML_dataframe3s, ML_dataframe3s$CN_new!='convalescent')
MLDF$CN_new <- as.character(paste(MLDF$CN_new))
MLDF$CN_new <- as.factor(MLDF$CN_new)
DFc <- convalescent[,1:6]
DFc$CN_new <- as.factor(paste('moderate'))

set.seed(505005)

rfMod <- train(CN_new~., method='rf', data=MLDF,
                trControl=trainControl(method='boot'))
predrf <- predict(rfMod, DFc)

DF_rfConvalescent3 <- data.frame(predrf, type=DFc$CN_new)
DF_rfConvalescent3

##    predrf     type
## 1 healthy moderate

The random forest model predicted the convalescent class to be healthy. It says moderate, because I had to change the class level or the model threw and exception error and also I had to reclass the 5 factor levels original to the subset of our training data to 4 factors, because it remembers the omitted class of convalescent. So in the previous chunck the outcome feature was reclassed as character then reclassed as factor to omit the remembered class of ‘convalescent.’

We can do the same thing by reclassifying it as healthy for the convalescent class and see if it works in all the data using the random forest model to classify the convalescent class as healthy.

MLDF2 <- ML_dataframe3s
MLDF2[MLDF2$CN_new=='convalescent',7] <- 'healthy'
MLDF2$CN_new <- as.character(paste(MLDF2$CN_new))
MLDF2$CN_new <- as.factor(MLDF2$CN_new)
DFc <- convalescent[,1:6]
DFc$CN_new <- as.factor(paste('healthy'))

set.seed(505005)

rfMod <- train(CN_new~., method='rf', data=MLDF2,
                trControl=trainControl(method='boot'))
predrf <- predict(rfMod, DFc)

DF_rfConvalescent4 <- data.frame(predrf, type=DFc$CN_new)
DF_rfConvalescent4

##    predrf    type
## 1 healthy healthy

So we see that throwing in the convalescent sample as a healthy sample didn’t affect the outcome of the random forest model predicting it as a healthy label. Lets go ahead and keep this dataset, MLDF2, with the convalescent sample as a healthy sample.

grep('convalescent',ML_dataframe3s$CN_new)

## [1] 12

We know that the original sample for convalescent is observation or row 12 in our data. Lets test these new genes on classifying our four classes of data and see if we can see an improvement from our other models.

set.seed(9875)

inTrain <- createDataPartition(y=MLDF2$CN_new, p=0.7, list=FALSE)

trainingSet <- MLDF2[inTrain,]
testingSet <- MLDF2[-inTrain,]
dim(trainingSet)

## [1] 25  7

dim(testingSet)

## [1] 9 7

set.seed(505005)

rfMod <- train(CN_new~., method='rf', data=trainingSet,
                trControl=trainControl(method='boot'))
predrf <- predict(rfMod, testingSet)

DF_RF <- data.frame(predrf, type=testingSet$CN_new)
DF_RF

##    predrf     type
## 1 healthy  healthy
## 2 healthy  healthy
## 3 healthy  healthy
## 4 healthy  healthy
## 5 healthy moderate
## 6     ICU   severe
## 7  severe      ICU
## 8     ICU   severe
## 9 healthy  healthy

precisionRecallAccuracy(DF_RF)

##      class         precision recall          accuracy
## 1  healthy 0.833333333333333      1 0.888888888888889
## 2 moderate                 0      0 0.888888888888889
## 3   severe                 0      0 0.666666666666667
## 4      ICU                 0      0 0.666666666666667

Nope! Not an improvement using these genes. Lets see how lda does and knn.

set.seed(505005)

LDAMod <- train(CN_new~., method='lda', data=trainingSet,
                trControl=trainControl(method='boot'))
predLDA <- predict(LDAMod, testingSet)

DF_LDA <- data.frame(predLDA, type=testingSet$CN_new)
DF_LDA

##   predLDA     type
## 1 healthy  healthy
## 2 healthy  healthy
## 3 healthy  healthy
## 4 healthy  healthy
## 5 healthy moderate
## 6     ICU   severe
## 7  severe      ICU
## 8  severe   severe
## 9 healthy  healthy

precisionRecallAccuracy(DF_LDA)

##      class         precision recall          accuracy
## 1  healthy 0.833333333333333      1 0.888888888888889
## 2 moderate                 0      0 0.888888888888889
## 3   severe               0.5    0.5 0.777777777777778
## 4      ICU                 0      0 0.777777777777778

The LDA model scored slightly better in precision and recall as well as accuracy in the severe class and the same for the healthy class.Now for the knn model.

set.seed(505005)

KNNMod <- train(CN_new~., method='knn', data=trainingSet,
                trControl=trainControl(method='boot'))
predKNN <- predict(KNNMod, testingSet)

DF_KNN <- data.frame(predKNN, type=testingSet$CN_new)
DF_KNN

##    predKNN     type
## 1  healthy  healthy
## 2  healthy  healthy
## 3 moderate  healthy
## 4  healthy  healthy
## 5   severe moderate
## 6      ICU   severe
## 7   severe      ICU
## 8  healthy   severe
## 9  healthy  healthy

precisionRecallAccuracy(DF_KNN)

##      class precision recall          accuracy
## 1  healthy       0.8    0.8 0.777777777777778
## 2 moderate         0      0 0.777777777777778
## 3   severe         0      0 0.555555555555556
## 4      ICU         0      0 0.777777777777778

The KNN model didn’t score any better with these genes for each sample, but slightly worse than the random forest and lda models. Lets see if we can use gbm and rpart for a difference. GBM is what I think is the gradient boosted machines and glm we could also try is the generalized linear models and rpart is a type of trees algorithm for recursive partitioned trees possibly similar to decision trees that don’t build the trees as in depth as the random forest trees.

The rpart model:

set.seed(505005)

RPARTMod <- train(CN_new~., method='rpart', data=trainingSet,
                trControl=trainControl(method='cv'))
predRPART <- predict(RPARTMod, testingSet)

DF_RPART <- data.frame(predRPART, type=testingSet$CN_new)
DF_RPART

The rpart model produced an ‘undefined columns’ error with these settings.

The gbm model:

set.seed(505005)

GBMMod <- train(CN_new~., method='gbm', data=trainingSet,
                trControl=trainControl(method='cv'))
predGBM <- predict(GBMMod, testingSet)

DF_GBM <- data.frame(predGBM, type=testingSet$CN_new)
DF_GBM

The gbm model also produced a ‘Something is wrong; all the Accuracy metric values are missing’ error.

The glm model:

set.seed(505005)

GLMMod <- train(CN_new~., method='glm', data=trainingSet,
                trControl=trainControl(method='boot'))
predGLM <- predict(GLMMod, testingSet)

DF_GLM <- data.frame(predGLM, type=testingSet$CN_new)
DF_GLM

The GLM model also produced the same error as the gbm model with missing accuracy and kappa metric values.

Lets move on to finding out the human body systems like Vitamin D and melanin, and see how they behave in the moderate, severe, and ICU cases compared to the healthy cases by way of our already calculated fold change values. We can use our functions find25genes and the other geneCards2.R script has.

I added the above function to our geneCards2.R sourced script file of functions. This should reset the folder ‘gene scrapes’ that the supplementary files are in too. Resource the script everytime you don’t need the supplementary files from another run, and want fresh files.

source('geneCards2.R')

find25genes('vitamin D')
find25genes('melanin')

getProteinGenes('vitamin D')
getProteinGenes('melanin')

vitD <- read.csv( "Top25vitamin-ds.csv")
melanin <- read.csv("Top25melanins.csv")

vitD3 <- vitD[1:3,1:2]
mel3 <- melanin[1:3,1:2]

integumentary <- rbind(vitD3,mel3)
head(integumentary)

##   proteinType proteinSearched
## 1         VDR       vitamin-d
## 2     CYP27B1       vitamin-d
## 3        PHEX       vitamin-d
## 4         TYR         melanin
## 5       TYRP1         melanin
## 6        OCA2         melanin

for (i in vitD3$proteinType){
  getSummaries2(i,'vitamin D')
}
for (i in mel3$proteinType){
  getSummaries2(i,'melanin')
}

getGeneSummaries('vitamin D')
getGeneSummaries('melanin')

vitD3summs <- read.csv("proteinGeneSummaries_vitamin-d.csv")
mel3summs <- read.csv("proteinGeneSummaries_melanin.csv")

vitDmel_summs <- rbind(vitD3summs,mel3summs)
head(vitDmel_summs)

##   proteinSearched    gene       EnsemblID
## 1       vitamin-d     VDR ENSG00000111424
## 2       vitamin-d CYP27B1 ENSG00000111012
## 3       vitamin-d    PHEX ENSG00000102174
## 4         melanin     TYR ENSG00000077498
## 5         melanin   TYRP1 ENSG00000107165
## 6         melanin    OCA2 ENSG00000104044
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               EntrezSummary
## 1 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
## 2                                                                                                                                                                                                                                          This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           GeneCardsSummary
## 1                                                                                                                                        VDR (Vitamin D Receptor) is a Protein Coding gene.                                            Diseases associated with VDR include Vitamin D-Dependent Rickets, Type 2A and Rickets.                                            Among its related pathways are Development_Hedgehog and PTH signaling pathways in bone and cartilage development and Tuberculosis.                                            Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and steroid hormone receptor activity.                                            An important paralog of this gene is NR1I2.
## 2                                             CYP27B1 (Cytochrome P450 Family 27 Subfamily B Member 1) is a Protein Coding gene.                                            Diseases associated with CYP27B1 include Vitamin D Hydroxylation-Deficient Rickets, Type 1A and Hypocalcemic Vitamin D-Dependent Rickets.                                            Among its related pathways are Cytochrome P450 - arranged by substrate type and Tuberculosis.                                            Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen.                                            An important paralog of this gene is CYP27A1.
## 3                                                                                                                                                                                                                 PHEX (Phosphate Regulating Endopeptidase Homolog X-Linked) is a Protein Coding gene.                                            Diseases associated with PHEX include Hypophosphatemic Rickets, X-Linked Dominant and Hypophosphatemic Rickets, X-Linked Recessive.                                                                                        Gene Ontology (GO) annotations related to this gene include metalloendopeptidase activity and aminopeptidase activity.                                            An important paralog of this gene is MMEL1.
## 4                                                                                                                                                                                          TYR (Tyrosinase) is a Protein Coding gene.                                            Diseases associated with TYR include Albinism, Oculocutaneous, Type Ia and Albinism, Oculocutaneous, Type Ib.                                            Among its related pathways are (S)-reticuline biosynthesis and Tyrosine metabolism.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is TYRP1.
## 5                                                                                                                                               TYRP1 (Tyrosinase Related Protein 1) is a Protein Coding gene.                                            Diseases associated with TYRP1 include Albinism, Oculocutaneous, Type Iii and Skin/Hair/Eye Pigmentation, Variation In, 11.                                            Among its related pathways are Aldosterone synthesis and secretion and Viral mRNA Translation.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is DCT.
## 6                                                                                                                                                     OCA2 (OCA2 Melanosomal Transmembrane Protein) is a Protein Coding gene.                                            Diseases associated with OCA2 include Albinism, Oculocutaneous, Type Ii and Skin/Hair/Eye Pigmentation, Variation In, 1.                                            Among its related pathways are Viral mRNA Translation and Metabolism.                                            Gene Ontology (GO) annotations related to this gene include transporter activity and L-tyrosine transmembrane transporter activity.                                            An important paralog of this gene is SLC13A2.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Nuclear receptor for calcitriol, the active form of vitamin D3 which mediates the action of this vitamin on cells (PubMed:28698609, PubMed:16913708, PubMed:15728261, PubMed:10678179). Enters the nucleus upon vitamin D3 binding where it forms heterodimers with the retinoid X receptor/RXR (PubMed:28698609). The VDR-RXR heterodimers bind to specific response elements on DNA and activate the transcription of vitamin D3-responsive target genes (PubMed:28698609). Plays a central role in calcium homeostasis (By similarity).\n                         VDR_HUMAN,P11473\n                         
## 2 A cytochrome P450 monooxygenase involved in vitamin D metabolism and in calcium and phosphorus homeostasis. Catalyzes the rate-limiting step in the activation of vitamin D in the kidney, namely the hydroxylation of 25-hydroxyvitamin D3/calcidiol at the C1alpha-position to form the hormonally active form of vitamin D3, 1alpha,25-dihydroxyvitamin D3/calcitriol that acts via the vitamin D receptor (VDR) (PubMed:10518789, PubMed:9486994, PubMed:22862690, PubMed:10566658, PubMed:12050193). Has 1alpha-hydroxylase activity on vitamin D intermediates of the CYP24A1-mediated inactivation pathway (PubMed:10518789, PubMed:22862690). Converts 24R,25-dihydroxyvitamin D3/secalciferol to 1-alpha,24,25-trihydroxyvitamin D3, an active ligand of VDR. Also active on 25-hydroxyvitamin D2 (PubMed:10518789). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via FDXR/adrenodoxin reductase and FDX1/adrenodoxin (PubMed:22862690).\n                         CP27B_HUMAN,O15528\n                         
## 3                                                                                                                                                                                                                                                                                                                                                           Peptidase that cleaves SIBLING (small integrin-binding ligand, N-linked glycoprotein)-derived ASARM peptides, thus regulating their biological activity (PubMed:9593714, PubMed:15664000, PubMed:18162525, PubMed:18597632). Cleaves ASARM peptides between Ser and Glu or Asp residues (PubMed:18597632). Regulates osteogenic cell differentiation and bone mineralization through the cleavage of the MEPE-derived ASARM peptide (PubMed:18597632). Promotes dentin mineralization and renal phosphate reabsorption by cleaving DMP1- and MEPE-derived ASARM peptides (PubMed:18597632, PubMed:18162525). Inhibits the cleavage of MEPE by CTSB/cathepsin B thus preventing MEPE degradation (PubMed:12220505).\n                         PHEX_HUMAN,P78562\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    This is a copper-containing oxidase that functions in the formation of pigments such as melanins and other polyphenolic compounds. Catalyzes the initial and rate limiting step in the cascade of reactions leading to melanin production from tyrosine. In addition to hydroxylating tyrosine to DOPA (3,4-dihydroxyphenylalanine), also catalyzes the oxidation of DOPA to DOPA-quinone, and possibly the oxidation of DHI (5,6-dihydroxyindole) to indole-5,6 quinone.\n                         TYRO_HUMAN,P14679\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Plays a role in melanin biosynthesis (PubMed:22556244, PubMed:16704458). Catalyzes the oxidation of 5,6-dihydroxyindole-2-carboxylic acid (DHICA) into indole-5,6-quinone-2-carboxylic acid in the presence of bound Cu(2+) ions, but not in the presence of Zn(2+) (PubMed:28661582). May regulate or influence the type of melanin synthesized (PubMed:22556244, PubMed:16704458). Also to a lower extent, capable of hydroxylating tyrosine and producing melanin (By similarity).\n                         TYRP1_HUMAN,P17643\n                         
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Could be involved in the transport of tyrosine, the precursor to melanin synthesis, within the melanocyte. Regulates the pH of melanosome and the melanosome maturation. One of the components of the mammalian pigmentary system. Seems to regulate the post-translational processing of tyrosinase, which catalyzes the limiting reaction in melanin synthesis. May serve as a key control point at which ethnic skin color variation is determined. Major determinant of brown and/or blue eye color.\n                         P_HUMAN,Q04671\n                         
##                 todaysDate
## 1 Sat Aug 15 18:15:31 2020
## 2 Sat Aug 15 18:15:32 2020
## 3 Sat Aug 15 18:15:33 2020
## 4 Sat Aug 15 18:15:34 2020
## 5 Sat Aug 15 18:15:35 2020
## 6 Sat Aug 15 18:15:36 2020

vitD_melanin <- vitDmel_summs[,c(1:6)]
vitD_melanin

##   proteinSearched    gene       EnsemblID
## 1       vitamin-d     VDR ENSG00000111424
## 2       vitamin-d CYP27B1 ENSG00000111012
## 3       vitamin-d    PHEX ENSG00000102174
## 4         melanin     TYR ENSG00000077498
## 5         melanin   TYRP1 ENSG00000107165
## 6         melanin    OCA2 ENSG00000104044
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               EntrezSummary
## 1 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
## 2                                                                                                                                                                                                                                          This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           GeneCardsSummary
## 1                                                                                                                                        VDR (Vitamin D Receptor) is a Protein Coding gene.                                            Diseases associated with VDR include Vitamin D-Dependent Rickets, Type 2A and Rickets.                                            Among its related pathways are Development_Hedgehog and PTH signaling pathways in bone and cartilage development and Tuberculosis.                                            Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and steroid hormone receptor activity.                                            An important paralog of this gene is NR1I2.
## 2                                             CYP27B1 (Cytochrome P450 Family 27 Subfamily B Member 1) is a Protein Coding gene.                                            Diseases associated with CYP27B1 include Vitamin D Hydroxylation-Deficient Rickets, Type 1A and Hypocalcemic Vitamin D-Dependent Rickets.                                            Among its related pathways are Cytochrome P450 - arranged by substrate type and Tuberculosis.                                            Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen.                                            An important paralog of this gene is CYP27A1.
## 3                                                                                                                                                                                                                 PHEX (Phosphate Regulating Endopeptidase Homolog X-Linked) is a Protein Coding gene.                                            Diseases associated with PHEX include Hypophosphatemic Rickets, X-Linked Dominant and Hypophosphatemic Rickets, X-Linked Recessive.                                                                                        Gene Ontology (GO) annotations related to this gene include metalloendopeptidase activity and aminopeptidase activity.                                            An important paralog of this gene is MMEL1.
## 4                                                                                                                                                                                          TYR (Tyrosinase) is a Protein Coding gene.                                            Diseases associated with TYR include Albinism, Oculocutaneous, Type Ia and Albinism, Oculocutaneous, Type Ib.                                            Among its related pathways are (S)-reticuline biosynthesis and Tyrosine metabolism.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is TYRP1.
## 5                                                                                                                                               TYRP1 (Tyrosinase Related Protein 1) is a Protein Coding gene.                                            Diseases associated with TYRP1 include Albinism, Oculocutaneous, Type Iii and Skin/Hair/Eye Pigmentation, Variation In, 11.                                            Among its related pathways are Aldosterone synthesis and secretion and Viral mRNA Translation.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is DCT.
## 6                                                                                                                                                     OCA2 (OCA2 Melanosomal Transmembrane Protein) is a Protein Coding gene.                                            Diseases associated with OCA2 include Albinism, Oculocutaneous, Type Ii and Skin/Hair/Eye Pigmentation, Variation In, 1.                                            Among its related pathways are Viral mRNA Translation and Metabolism.                                            Gene Ontology (GO) annotations related to this gene include transporter activity and L-tyrosine transmembrane transporter activity.                                            An important paralog of this gene is SLC13A2.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Nuclear receptor for calcitriol, the active form of vitamin D3 which mediates the action of this vitamin on cells (PubMed:28698609, PubMed:16913708, PubMed:15728261, PubMed:10678179). Enters the nucleus upon vitamin D3 binding where it forms heterodimers with the retinoid X receptor/RXR (PubMed:28698609). The VDR-RXR heterodimers bind to specific response elements on DNA and activate the transcription of vitamin D3-responsive target genes (PubMed:28698609). Plays a central role in calcium homeostasis (By similarity).\n                         VDR_HUMAN,P11473\n                         
## 2 A cytochrome P450 monooxygenase involved in vitamin D metabolism and in calcium and phosphorus homeostasis. Catalyzes the rate-limiting step in the activation of vitamin D in the kidney, namely the hydroxylation of 25-hydroxyvitamin D3/calcidiol at the C1alpha-position to form the hormonally active form of vitamin D3, 1alpha,25-dihydroxyvitamin D3/calcitriol that acts via the vitamin D receptor (VDR) (PubMed:10518789, PubMed:9486994, PubMed:22862690, PubMed:10566658, PubMed:12050193). Has 1alpha-hydroxylase activity on vitamin D intermediates of the CYP24A1-mediated inactivation pathway (PubMed:10518789, PubMed:22862690). Converts 24R,25-dihydroxyvitamin D3/secalciferol to 1-alpha,24,25-trihydroxyvitamin D3, an active ligand of VDR. Also active on 25-hydroxyvitamin D2 (PubMed:10518789). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via FDXR/adrenodoxin reductase and FDX1/adrenodoxin (PubMed:22862690).\n                         CP27B_HUMAN,O15528\n                         
## 3                                                                                                                                                                                                                                                                                                                                                           Peptidase that cleaves SIBLING (small integrin-binding ligand, N-linked glycoprotein)-derived ASARM peptides, thus regulating their biological activity (PubMed:9593714, PubMed:15664000, PubMed:18162525, PubMed:18597632). Cleaves ASARM peptides between Ser and Glu or Asp residues (PubMed:18597632). Regulates osteogenic cell differentiation and bone mineralization through the cleavage of the MEPE-derived ASARM peptide (PubMed:18597632). Promotes dentin mineralization and renal phosphate reabsorption by cleaving DMP1- and MEPE-derived ASARM peptides (PubMed:18597632, PubMed:18162525). Inhibits the cleavage of MEPE by CTSB/cathepsin B thus preventing MEPE degradation (PubMed:12220505).\n                         PHEX_HUMAN,P78562\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    This is a copper-containing oxidase that functions in the formation of pigments such as melanins and other polyphenolic compounds. Catalyzes the initial and rate limiting step in the cascade of reactions leading to melanin production from tyrosine. In addition to hydroxylating tyrosine to DOPA (3,4-dihydroxyphenylalanine), also catalyzes the oxidation of DOPA to DOPA-quinone, and possibly the oxidation of DHI (5,6-dihydroxyindole) to indole-5,6 quinone.\n                         TYRO_HUMAN,P14679\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Plays a role in melanin biosynthesis (PubMed:22556244, PubMed:16704458). Catalyzes the oxidation of 5,6-dihydroxyindole-2-carboxylic acid (DHICA) into indole-5,6-quinone-2-carboxylic acid in the presence of bound Cu(2+) ions, but not in the presence of Zn(2+) (PubMed:28661582). May regulate or influence the type of melanin synthesized (PubMed:22556244, PubMed:16704458). Also to a lower extent, capable of hydroxylating tyrosine and producing melanin (By similarity).\n                         TYRP1_HUMAN,P17643\n                         
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Could be involved in the transport of tyrosine, the precursor to melanin synthesis, within the melanocyte. Regulates the pH of melanosome and the melanosome maturation. One of the components of the mammalian pigmentary system. Seems to regulate the post-translational processing of tyrosinase, which catalyzes the limiting reaction in melanin synthesis. May serve as a key control point at which ethnic skin color variation is determined. Major determinant of brown and/or blue eye color.\n                         P_HUMAN,Q04671\n

origFCs <- read.csv('DATA_FCs_GSE152418.csv',sep=',',
                    na.strings=c('',' ','NA'),
                    stringsAsFactors = F)
sun <- merge(vitD_melanin,origFCs, by.x='EnsemblID',
             by.y='ENSEMBLID')
sun

##         EnsemblID proteinSearched    gene
## 1 ENSG00000077498         melanin     TYR
## 2 ENSG00000102174       vitamin-d    PHEX
## 3 ENSG00000104044         melanin    OCA2
## 4 ENSG00000107165         melanin   TYRP1
## 5 ENSG00000111012       vitamin-d CYP27B1
## 6 ENSG00000111424       vitamin-d     VDR
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               EntrezSummary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 5                                                                                                                                                                                                                                          This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 6 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           GeneCardsSummary
## 1                                                                                                                                                                                          TYR (Tyrosinase) is a Protein Coding gene.                                            Diseases associated with TYR include Albinism, Oculocutaneous, Type Ia and Albinism, Oculocutaneous, Type Ib.                                            Among its related pathways are (S)-reticuline biosynthesis and Tyrosine metabolism.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is TYRP1.
## 2                                                                                                                                                                                                                 PHEX (Phosphate Regulating Endopeptidase Homolog X-Linked) is a Protein Coding gene.                                            Diseases associated with PHEX include Hypophosphatemic Rickets, X-Linked Dominant and Hypophosphatemic Rickets, X-Linked Recessive.                                                                                        Gene Ontology (GO) annotations related to this gene include metalloendopeptidase activity and aminopeptidase activity.                                            An important paralog of this gene is MMEL1.
## 3                                                                                                                                                     OCA2 (OCA2 Melanosomal Transmembrane Protein) is a Protein Coding gene.                                            Diseases associated with OCA2 include Albinism, Oculocutaneous, Type Ii and Skin/Hair/Eye Pigmentation, Variation In, 1.                                            Among its related pathways are Viral mRNA Translation and Metabolism.                                            Gene Ontology (GO) annotations related to this gene include transporter activity and L-tyrosine transmembrane transporter activity.                                            An important paralog of this gene is SLC13A2.
## 4                                                                                                                                               TYRP1 (Tyrosinase Related Protein 1) is a Protein Coding gene.                                            Diseases associated with TYRP1 include Albinism, Oculocutaneous, Type Iii and Skin/Hair/Eye Pigmentation, Variation In, 11.                                            Among its related pathways are Aldosterone synthesis and secretion and Viral mRNA Translation.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is DCT.
## 5                                             CYP27B1 (Cytochrome P450 Family 27 Subfamily B Member 1) is a Protein Coding gene.                                            Diseases associated with CYP27B1 include Vitamin D Hydroxylation-Deficient Rickets, Type 1A and Hypocalcemic Vitamin D-Dependent Rickets.                                            Among its related pathways are Cytochrome P450 - arranged by substrate type and Tuberculosis.                                            Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen.                                            An important paralog of this gene is CYP27A1.
## 6                                                                                                                                        VDR (Vitamin D Receptor) is a Protein Coding gene.                                            Diseases associated with VDR include Vitamin D-Dependent Rickets, Type 2A and Rickets.                                            Among its related pathways are Development_Hedgehog and PTH signaling pathways in bone and cartilage development and Tuberculosis.                                            Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and steroid hormone receptor activity.                                            An important paralog of this gene is NR1I2.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    This is a copper-containing oxidase that functions in the formation of pigments such as melanins and other polyphenolic compounds. Catalyzes the initial and rate limiting step in the cascade of reactions leading to melanin production from tyrosine. In addition to hydroxylating tyrosine to DOPA (3,4-dihydroxyphenylalanine), also catalyzes the oxidation of DOPA to DOPA-quinone, and possibly the oxidation of DHI (5,6-dihydroxyindole) to indole-5,6 quinone.\n                         TYRO_HUMAN,P14679\n                         
## 2                                                                                                                                                                                                                                                                                                                                                           Peptidase that cleaves SIBLING (small integrin-binding ligand, N-linked glycoprotein)-derived ASARM peptides, thus regulating their biological activity (PubMed:9593714, PubMed:15664000, PubMed:18162525, PubMed:18597632). Cleaves ASARM peptides between Ser and Glu or Asp residues (PubMed:18597632). Regulates osteogenic cell differentiation and bone mineralization through the cleavage of the MEPE-derived ASARM peptide (PubMed:18597632). Promotes dentin mineralization and renal phosphate reabsorption by cleaving DMP1- and MEPE-derived ASARM peptides (PubMed:18597632, PubMed:18162525). Inhibits the cleavage of MEPE by CTSB/cathepsin B thus preventing MEPE degradation (PubMed:12220505).\n                         PHEX_HUMAN,P78562\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Could be involved in the transport of tyrosine, the precursor to melanin synthesis, within the melanocyte. Regulates the pH of melanosome and the melanosome maturation. One of the components of the mammalian pigmentary system. Seems to regulate the post-translational processing of tyrosinase, which catalyzes the limiting reaction in melanin synthesis. May serve as a key control point at which ethnic skin color variation is determined. Major determinant of brown and/or blue eye color.\n                         P_HUMAN,Q04671\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Plays a role in melanin biosynthesis (PubMed:22556244, PubMed:16704458). Catalyzes the oxidation of 5,6-dihydroxyindole-2-carboxylic acid (DHICA) into indole-5,6-quinone-2-carboxylic acid in the presence of bound Cu(2+) ions, but not in the presence of Zn(2+) (PubMed:28661582). May regulate or influence the type of melanin synthesized (PubMed:22556244, PubMed:16704458). Also to a lower extent, capable of hydroxylating tyrosine and producing melanin (By similarity).\n                         TYRP1_HUMAN,P17643\n                         
## 5 A cytochrome P450 monooxygenase involved in vitamin D metabolism and in calcium and phosphorus homeostasis. Catalyzes the rate-limiting step in the activation of vitamin D in the kidney, namely the hydroxylation of 25-hydroxyvitamin D3/calcidiol at the C1alpha-position to form the hormonally active form of vitamin D3, 1alpha,25-dihydroxyvitamin D3/calcitriol that acts via the vitamin D receptor (VDR) (PubMed:10518789, PubMed:9486994, PubMed:22862690, PubMed:10566658, PubMed:12050193). Has 1alpha-hydroxylase activity on vitamin D intermediates of the CYP24A1-mediated inactivation pathway (PubMed:10518789, PubMed:22862690). Converts 24R,25-dihydroxyvitamin D3/secalciferol to 1-alpha,24,25-trihydroxyvitamin D3, an active ligand of VDR. Also active on 25-hydroxyvitamin D2 (PubMed:10518789). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via FDXR/adrenodoxin reductase and FDX1/adrenodoxin (PubMed:22862690).\n                         CP27B_HUMAN,O15528\n                         
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Nuclear receptor for calcitriol, the active form of vitamin D3 which mediates the action of this vitamin on cells (PubMed:28698609, PubMed:16913708, PubMed:15728261, PubMed:10678179). Enters the nucleus upon vitamin D3 binding where it forms heterodimers with the retinoid X receptor/RXR (PubMed:28698609). The VDR-RXR heterodimers bind to specific response elements on DNA and activate the transcription of vitamin D3-responsive target genes (PubMed:28698609). Plays a central role in calcium homeostasis (By similarity).\n                         VDR_HUMAN,P11473\n                         
##   convalescent healthy1 healthy2 healthy3 healthy4 healthy5 healthy6 healthy7
## 1            0        0        0        0        0        0        0        0
## 2           24       17       25       27        9       44       88       15
## 3            2        0        3        1        0        2        0        2
## 4            0        0        0        0        1        0        0        0
## 5           11       13       16        2        9       14       11        4
## 6          539      592      437      303      365      658      846      471
##   healthy8 healthy9 healthy10 healthy11 healthy12 healthy13 healthy14 healthy15
## 1        0        0         0         0         0         0         0         2
## 2       29       38        23       248        10        10        43        40
## 3        4        3         2         0         3         3         2         1
## 4        0        0         0         0         0         0         1         0
## 5        1        8         5         9         4         5        11        12
## 6      502      422       458       488       373       361       334       368
##   healthy16 healthy17 moderate1 moderate2 moderate3 moderate4 severe1 severe2
## 1         0         0         0         0         0         0       1       0
## 2         5        25        14         6         8        10      11       6
## 3         4         4         1         0         4         1       4       2
## 4         0         0         0         1         1         2       1       1
## 5         4         6         5         6        12        15       3      19
## 6       318       330       188       421       469       360     415     400
##   severe3 severe4 severe5 severe6 severe7 severe8 ICU1 ICU2 ICU3 ICU4
## 1       0       0       0       1       0       1    0    2    0    0
## 2       2       3      47      12       7       8   13    9   30    9
## 3       1       1       0       2       2      27    0    5    3    1
## 4       0       0       0       1       1       0    0    1    4    0
## 5      13       0       4       7       9      27   11   15    1    7
## 6     335     623     670     493     398     376  708  639  701  424
##   healthy_mean moderate_mean severe_mean ICU_mean mod_health_foldChange
## 1    0.1176471           0.0       0.375     0.50             0.8823529
## 2   40.9411800           9.5      12.000    15.25             0.2320402
## 3    2.0000000           1.5       4.875     2.25             0.7500000
## 4    0.1176471           1.0       0.500     1.25             8.5000000
## 5    7.8823530           9.5      10.250     8.50             1.2052240
## 6  448.5882000         359.5     463.750   618.00             0.8014031
##   sevr_health_foldChange ICU_health_foldChange
## 1              3.1875000             4.2500000
## 2              0.2931034             0.3724856
## 3              2.4375000             1.1250000
## 4              4.2500000            10.6250000
## 5              1.3003730             1.0783580
## 6              1.0337990             1.3776550

Great lets write this out to csv and use it to put together some quick Tableau charts.

write.csv(sun,'sunGenes.csv',row.names=F)

Tableau External Window

Each of the following are from the same chart comparing sun Gene values in COVID-19 cases and VDR does increase with the severity of the case. The top outlier was subsequently exluded within Tableau to knock down the outlier genes, but hovering gives the names and gene summary as well as lableded fold change values of the disease/healthy mean ratio for each respective disease class.

image 1 sun Genes with VDR at the top

image 2 sun Genes with VDR excluded and PHEX as the highest healthy mean value showing

image 3 sun Genes with PHEX and VDR excluded

image 4 sun Genes with PHEX, VDR, and CYP27B1 excluded

Lets look at the data in another way. I want to gather the columns and combine with the demographic information. But first, there are two genes in the vitamin D genes’ top 3 ranked from genecards.org, PHEX and VDR, that increase in fold change in the cases of moderate, to severe, to ICU for COVID-19, and one melanin gene, TYR.

sun[,c(2:4,45:47)]

##   proteinSearched    gene
## 1         melanin     TYR
## 2       vitamin-d    PHEX
## 3         melanin    OCA2
## 4         melanin   TYRP1
## 5       vitamin-d CYP27B1
## 6       vitamin-d     VDR
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               EntrezSummary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 5                                                                                                                                                                                                                                          This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 6 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
##   mod_health_foldChange sevr_health_foldChange ICU_health_foldChange
## 1             0.8823529              3.1875000             4.2500000
## 2             0.2320402              0.2931034             0.3724856
## 3             0.7500000              2.4375000             1.1250000
## 4             8.5000000              4.2500000            10.6250000
## 5             1.2052240              1.3003730             1.0783580
## 6             0.8014031              1.0337990             1.3776550

sunGathered <- gather(sun, key='sampleType',value='geneValue',7:40)
head(sunGathered)

##         EnsemblID proteinSearched    gene
## 1 ENSG00000077498         melanin     TYR
## 2 ENSG00000102174       vitamin-d    PHEX
## 3 ENSG00000104044         melanin    OCA2
## 4 ENSG00000107165         melanin   TYRP1
## 5 ENSG00000111012       vitamin-d CYP27B1
## 6 ENSG00000111424       vitamin-d     VDR
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               EntrezSummary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 5                                                                                                                                                                                                                                          This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 6 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           GeneCardsSummary
## 1                                                                                                                                                                                          TYR (Tyrosinase) is a Protein Coding gene.                                            Diseases associated with TYR include Albinism, Oculocutaneous, Type Ia and Albinism, Oculocutaneous, Type Ib.                                            Among its related pathways are (S)-reticuline biosynthesis and Tyrosine metabolism.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is TYRP1.
## 2                                                                                                                                                                                                                 PHEX (Phosphate Regulating Endopeptidase Homolog X-Linked) is a Protein Coding gene.                                            Diseases associated with PHEX include Hypophosphatemic Rickets, X-Linked Dominant and Hypophosphatemic Rickets, X-Linked Recessive.                                                                                        Gene Ontology (GO) annotations related to this gene include metalloendopeptidase activity and aminopeptidase activity.                                            An important paralog of this gene is MMEL1.
## 3                                                                                                                                                     OCA2 (OCA2 Melanosomal Transmembrane Protein) is a Protein Coding gene.                                            Diseases associated with OCA2 include Albinism, Oculocutaneous, Type Ii and Skin/Hair/Eye Pigmentation, Variation In, 1.                                            Among its related pathways are Viral mRNA Translation and Metabolism.                                            Gene Ontology (GO) annotations related to this gene include transporter activity and L-tyrosine transmembrane transporter activity.                                            An important paralog of this gene is SLC13A2.
## 4                                                                                                                                               TYRP1 (Tyrosinase Related Protein 1) is a Protein Coding gene.                                            Diseases associated with TYRP1 include Albinism, Oculocutaneous, Type Iii and Skin/Hair/Eye Pigmentation, Variation In, 11.                                            Among its related pathways are Aldosterone synthesis and secretion and Viral mRNA Translation.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is DCT.
## 5                                             CYP27B1 (Cytochrome P450 Family 27 Subfamily B Member 1) is a Protein Coding gene.                                            Diseases associated with CYP27B1 include Vitamin D Hydroxylation-Deficient Rickets, Type 1A and Hypocalcemic Vitamin D-Dependent Rickets.                                            Among its related pathways are Cytochrome P450 - arranged by substrate type and Tuberculosis.                                            Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen.                                            An important paralog of this gene is CYP27A1.
## 6                                                                                                                                        VDR (Vitamin D Receptor) is a Protein Coding gene.                                            Diseases associated with VDR include Vitamin D-Dependent Rickets, Type 2A and Rickets.                                            Among its related pathways are Development_Hedgehog and PTH signaling pathways in bone and cartilage development and Tuberculosis.                                            Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and steroid hormone receptor activity.                                            An important paralog of this gene is NR1I2.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    This is a copper-containing oxidase that functions in the formation of pigments such as melanins and other polyphenolic compounds. Catalyzes the initial and rate limiting step in the cascade of reactions leading to melanin production from tyrosine. In addition to hydroxylating tyrosine to DOPA (3,4-dihydroxyphenylalanine), also catalyzes the oxidation of DOPA to DOPA-quinone, and possibly the oxidation of DHI (5,6-dihydroxyindole) to indole-5,6 quinone.\n                         TYRO_HUMAN,P14679\n                         
## 2                                                                                                                                                                                                                                                                                                                                                           Peptidase that cleaves SIBLING (small integrin-binding ligand, N-linked glycoprotein)-derived ASARM peptides, thus regulating their biological activity (PubMed:9593714, PubMed:15664000, PubMed:18162525, PubMed:18597632). Cleaves ASARM peptides between Ser and Glu or Asp residues (PubMed:18597632). Regulates osteogenic cell differentiation and bone mineralization through the cleavage of the MEPE-derived ASARM peptide (PubMed:18597632). Promotes dentin mineralization and renal phosphate reabsorption by cleaving DMP1- and MEPE-derived ASARM peptides (PubMed:18597632, PubMed:18162525). Inhibits the cleavage of MEPE by CTSB/cathepsin B thus preventing MEPE degradation (PubMed:12220505).\n                         PHEX_HUMAN,P78562\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Could be involved in the transport of tyrosine, the precursor to melanin synthesis, within the melanocyte. Regulates the pH of melanosome and the melanosome maturation. One of the components of the mammalian pigmentary system. Seems to regulate the post-translational processing of tyrosinase, which catalyzes the limiting reaction in melanin synthesis. May serve as a key control point at which ethnic skin color variation is determined. Major determinant of brown and/or blue eye color.\n                         P_HUMAN,Q04671\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Plays a role in melanin biosynthesis (PubMed:22556244, PubMed:16704458). Catalyzes the oxidation of 5,6-dihydroxyindole-2-carboxylic acid (DHICA) into indole-5,6-quinone-2-carboxylic acid in the presence of bound Cu(2+) ions, but not in the presence of Zn(2+) (PubMed:28661582). May regulate or influence the type of melanin synthesized (PubMed:22556244, PubMed:16704458). Also to a lower extent, capable of hydroxylating tyrosine and producing melanin (By similarity).\n                         TYRP1_HUMAN,P17643\n                         
## 5 A cytochrome P450 monooxygenase involved in vitamin D metabolism and in calcium and phosphorus homeostasis. Catalyzes the rate-limiting step in the activation of vitamin D in the kidney, namely the hydroxylation of 25-hydroxyvitamin D3/calcidiol at the C1alpha-position to form the hormonally active form of vitamin D3, 1alpha,25-dihydroxyvitamin D3/calcitriol that acts via the vitamin D receptor (VDR) (PubMed:10518789, PubMed:9486994, PubMed:22862690, PubMed:10566658, PubMed:12050193). Has 1alpha-hydroxylase activity on vitamin D intermediates of the CYP24A1-mediated inactivation pathway (PubMed:10518789, PubMed:22862690). Converts 24R,25-dihydroxyvitamin D3/secalciferol to 1-alpha,24,25-trihydroxyvitamin D3, an active ligand of VDR. Also active on 25-hydroxyvitamin D2 (PubMed:10518789). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via FDXR/adrenodoxin reductase and FDX1/adrenodoxin (PubMed:22862690).\n                         CP27B_HUMAN,O15528\n                         
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Nuclear receptor for calcitriol, the active form of vitamin D3 which mediates the action of this vitamin on cells (PubMed:28698609, PubMed:16913708, PubMed:15728261, PubMed:10678179). Enters the nucleus upon vitamin D3 binding where it forms heterodimers with the retinoid X receptor/RXR (PubMed:28698609). The VDR-RXR heterodimers bind to specific response elements on DNA and activate the transcription of vitamin D3-responsive target genes (PubMed:28698609). Plays a central role in calcium homeostasis (By similarity).\n                         VDR_HUMAN,P11473\n                         
##   healthy_mean moderate_mean severe_mean ICU_mean mod_health_foldChange
## 1    0.1176471           0.0       0.375     0.50             0.8823529
## 2   40.9411800           9.5      12.000    15.25             0.2320402
## 3    2.0000000           1.5       4.875     2.25             0.7500000
## 4    0.1176471           1.0       0.500     1.25             8.5000000
## 5    7.8823530           9.5      10.250     8.50             1.2052240
## 6  448.5882000         359.5     463.750   618.00             0.8014031
##   sevr_health_foldChange ICU_health_foldChange   sampleType geneValue
## 1              3.1875000             4.2500000 convalescent         0
## 2              0.2931034             0.3724856 convalescent        24
## 3              2.4375000             1.1250000 convalescent         2
## 4              4.2500000            10.6250000 convalescent         0
## 5              1.3003730             1.0783580 convalescent        11
## 6              1.0337990             1.3776550 convalescent       539

sunData <- merge(HeaderInformation,sunGathered,by.x='CN_new',
                 by.y='sampleType')
write.csv(sunData,'sunData.csv',row.names=FALSE)

An interesting chart showing the raw values of gene expression for the three melanin and Vitamin D genes top ranked by genecards.org was created and shared on Tableau Public Server

We can see some genes decrease with age and others increase, also the gender differences as a color dimension shows changes in gene expression by gender. Take a look at the link above in Tableau and the images below to this chart.

Demographics of Sun Genes

In the first image above, there is one outlier for a 76 year old gene when looking at the melanin gene OCA2 labeled severe8 that has much more of this gene expressed than other severe cases. But this gene’s values tend to fluctuate on a cases by case basis for each severe sample.

Demographics of Sun Genes with Entrez gene summary

Demographics of Sun Genes Female outlier Vitamin D

The details when hovering will be displayed if you select a scatter point to give the severity of the case and corresponding sample in that group. The last image shows with age for both females and males the CYP27B1 increases except for a female in her 60s identified as severe4 in the severity raw values of this gene. And a 54 year old male labeled as severe1 is also not following the increase with age.But otherwise this gene CYP27B1 increases with age in both genders within the severe and ICU severe cases, while decreasing with age in the moderate cases. In healthy cases there is not a recognizable pattern as it varies with age and gender.

Lets not stop at the sun genes, as a massage therapist of more than 14 years of experience and having recently studied for and taken and passed my MBLEx or Massage and Bodywork Licensing Examination, I can tell you there are many fascinating items of the body systems and mineral as well as vitamin dependencies that lead to disease in some people. But when relearning the endocrine system and the hormones related to the pineal, hypothalamus, pituitary, adrenals, thyroid, and pancreas many other vitamins, steroids, and hormones should be looked at in studying these different cases of COVID-19.

We will look at the Vitamin C which helps the body absorb Vitamin D and make calcium in the bone blood, the glucagon that turn glucose into sugar and insulin that lowers glucose in the blood having to do with the pancreas hormones, dopamine that relates to parkinsons disease when the hypothalamus doesn’t produce enough, melatonin that regulates sleep and produced by the pineal gland near the pituitary and hypothalamus in the brain that regulates sleep, estrogen, prolactin, and progesterone regulated by the pituitary gland in the brain in females, testosterone regulated by the males in their testes, and corticosteroids and adrenaline regulated by the adrenals when in sympathetic response of danger in the body. Also, the vitamins that people are commonly told to take in addition to Vitamin C and Vitamin D, such as fish oil or omega 3s, vitamin B12 or zinc, and magnesium mineral.Also, calcitonin, a thyroid hormone that breaks down calcium so that the kidneys don’t get kidney stones nor other healthy problems.

Lets use our data we created earlier with our fold change modified values for Inf, NaN, and 0 when z/0, 0/0,0/z where z is any rational number and the alternate value was replaced in respective case with disease mean, 1, or healthy mean.

dataAllFCs <- read.csv('DATA_FCs_GSE152418-monotonicGenes.csv',sep=',',header=T, na.strings=c('',' ','NA'),
                       stringsAsFactors = F)
colnames(dataAllFCs)

##  [1] "ENSEMBLID"              "convalescent"           "healthy1"              
##  [4] "healthy2"               "healthy3"               "healthy4"              
##  [7] "healthy5"               "healthy6"               "healthy7"              
## [10] "healthy8"               "healthy9"               "healthy10"             
## [13] "healthy11"              "healthy12"              "healthy13"             
## [16] "healthy14"              "healthy15"              "healthy16"             
## [19] "healthy17"              "moderate1"              "moderate2"             
## [22] "moderate3"              "moderate4"              "severe1"               
## [25] "severe2"                "severe3"                "severe4"               
## [28] "severe5"                "severe6"                "severe7"               
## [31] "severe8"                "ICU1"                   "ICU2"                  
## [34] "ICU3"                   "ICU4"                   "healthy_mean"          
## [37] "moderate_mean"          "severe_mean"            "ICU_mean"              
## [40] "mod_health_foldChange"  "sevr_health_foldChange" "ICU_health_foldChange" 
## [43] "monotonicIncrease"      "monotonicDecrease"

That is a lot of genes to combine with our find25genes function and summarise, so lets just get the vitamins, minerals, and hormones we are interested in first, combine those into a data set and then add the HGNC gene symbol and the summaries to plot and discuss our findings.

source('geneCards2.R')

find25genes('vitamin D')
find25genes('melanin')
find25genes('vitamin C')
find25genes('glucose')
find25genes('insulin')
find25genes('glucagon')
find25genes('dopamine')
find25genes('estrogen')
find25genes('progesterone')
find25genes('prolactin')
find25genes('testosterone')
find25genes('calcium')
find25genes('melatonin')
find25genes('vitamin B12')
find25genes('zinc')
find25genes('magnesium')
find25genes('fish oil')
find25genes('omega 3s')
find25genes('adrenaline')
find25genes('corticosteroids')
find25genes('calcitonine')
find25genes('iron')

getProteinGenes('vitamin D')
getProteinGenes('melanin')
getProteinGenes('vitamin C')
getProteinGenes('glucose')
getProteinGenes('insulin')
getProteinGenes('glucagon')
getProteinGenes('dopamine')
getProteinGenes('estrogen')
getProteinGenes('progesterone')
getProteinGenes('prolactin')
getProteinGenes('testosterone')
getProteinGenes('calcium')
getProteinGenes('melatonin')
getProteinGenes('vitamin B12')
getProteinGenes('zinc')
getProteinGenes('magnesium')
getProteinGenes('fish oil')
getProteinGenes('omega 3s')
getProteinGenes('adrenaline')
getProteinGenes('corticosteroids')
getProteinGenes('calcitonine')
getProteinGenes('iron')

vitD <- read.csv('Top25vitamin-ds.csv')
melanin <- read.csv('Top25melanins.csv')
vitC <- read.csv('Top25vitamin-cs.csv')
glucose <- read.csv('Top25glucoses.csv')
insulin <- read.csv('Top25insulins.csv')
glucagon <- read.csv('Top25glucagons.csv')
dopamine <- read.csv('Top25dopamines.csv')
estrogen <- read.csv('Top25estrogens.csv')
progesterone <- read.csv('Top25progesterones.csv')
prolactin <- read.csv('Top25prolactins.csv')
testosterone <- read.csv('Top25testosterones.csv')
calcium <- read.csv('Top25calciums.csv')
melatonin <- read.csv('Top25melatonins.csv')
vitB12 <- read.csv('Top25vitamin-b12s.csv')
zinc <- read.csv('Top25zincs.csv')
magnesium <- read.csv('Top25magnesiums.csv')
fishOil <- read.csv('Top25fish-oils.csv')
omega3s <- read.csv('Top25omega-3ss.csv')
adrenaline <- read.csv('Top25adrenalines.csv')
corticosteroid <- read.csv('Top25corticosteroidss.csv')
calcitonine <- read.csv('Top25calcitonines.csv')
iron <- read.csv('Top25irons.csv')

Lets only take the top 3 from each data frame of mineral, vitamin, or steroid.

vitMinSter <- rbind(vitD[1:3,1:2],melanin[1:3,1:2],
                    vitC[1:3,1:2],glucose[1:3,1:2],
                    insulin[1:3,1:2],glucagon[1:3,1:2],
                    dopamine[1:3,1:2],estrogen[1:3,1:2],
                    progesterone[1:3,1:2],prolactin[1:3,1:2],
                    testosterone[1:3,1:2],calcium[1:3,1:2],
                    calcitonine[1:3,1:2],melatonin[1:3,1:2],
                    vitB12[1:3,1:2],zinc[1:3,1:2],magnesium[1:3,1:2],
                    fishOil[1:3,1:2],omega3s[1:3,1:2],
                    adrenaline[1:3,1:2],iron[1:3,1:2],
                    corticosteroid[1:3,1:2])
head(vitMinSter)

##   proteinType proteinSearched
## 1         VDR       vitamin-d
## 2     CYP27B1       vitamin-d
## 3        PHEX       vitamin-d
## 4         TYR         melanin
## 5       TYRP1         melanin
## 6        OCA2         melanin

Some of the genes associated with one vitamin also associate with another. We will keep them this way for the visualizations or charting.We could make a link analysis with these genes that are associated with other vitamins and minerals, but if not then you should.

Lets now get the gene summaries of these genes.

for (i in vitMinSter$proteinType){
  getSummaries2(i,'protein')
}

getGeneSummaries('protein')

I only want the 2nd through 6th features.

vitMinSterSumms <- read.csv("proteinGeneSummaries_protein.csv"
)

vitMinSterSumms2 <- vitMinSterSumms[,2:6]
head(vitMinSterSumms2)

##      gene       EnsemblID
## 1     VDR ENSG00000111424
## 2 CYP27B1 ENSG00000111012
## 3    PHEX ENSG00000102174
## 4     TYR ENSG00000077498
## 5   TYRP1 ENSG00000107165
## 6    OCA2 ENSG00000104044
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               EntrezSummary
## 1 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
## 2                                                                                                                                                                                                                                          This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           GeneCardsSummary
## 1                                                                                                                                        VDR (Vitamin D Receptor) is a Protein Coding gene.                                            Diseases associated with VDR include Vitamin D-Dependent Rickets, Type 2A and Rickets.                                            Among its related pathways are Development_Hedgehog and PTH signaling pathways in bone and cartilage development and Tuberculosis.                                            Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and steroid hormone receptor activity.                                            An important paralog of this gene is NR1I2.
## 2                                             CYP27B1 (Cytochrome P450 Family 27 Subfamily B Member 1) is a Protein Coding gene.                                            Diseases associated with CYP27B1 include Vitamin D Hydroxylation-Deficient Rickets, Type 1A and Hypocalcemic Vitamin D-Dependent Rickets.                                            Among its related pathways are Cytochrome P450 - arranged by substrate type and Tuberculosis.                                            Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen.                                            An important paralog of this gene is CYP27A1.
## 3                                                                                                                                                                                                                 PHEX (Phosphate Regulating Endopeptidase Homolog X-Linked) is a Protein Coding gene.                                            Diseases associated with PHEX include Hypophosphatemic Rickets, X-Linked Dominant and Hypophosphatemic Rickets, X-Linked Recessive.                                                                                        Gene Ontology (GO) annotations related to this gene include metalloendopeptidase activity and aminopeptidase activity.                                            An important paralog of this gene is MMEL1.
## 4                                                                                                                                                                                          TYR (Tyrosinase) is a Protein Coding gene.                                            Diseases associated with TYR include Albinism, Oculocutaneous, Type Ia and Albinism, Oculocutaneous, Type Ib.                                            Among its related pathways are (S)-reticuline biosynthesis and Tyrosine metabolism.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is TYRP1.
## 5                                                                                                                                               TYRP1 (Tyrosinase Related Protein 1) is a Protein Coding gene.                                            Diseases associated with TYRP1 include Albinism, Oculocutaneous, Type Iii and Skin/Hair/Eye Pigmentation, Variation In, 11.                                            Among its related pathways are Aldosterone synthesis and secretion and Viral mRNA Translation.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is DCT.
## 6                                                                                                                                                     OCA2 (OCA2 Melanosomal Transmembrane Protein) is a Protein Coding gene.                                            Diseases associated with OCA2 include Albinism, Oculocutaneous, Type Ii and Skin/Hair/Eye Pigmentation, Variation In, 1.                                            Among its related pathways are Viral mRNA Translation and Metabolism.                                            Gene Ontology (GO) annotations related to this gene include transporter activity and L-tyrosine transmembrane transporter activity.                                            An important paralog of this gene is SLC13A2.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Nuclear receptor for calcitriol, the active form of vitamin D3 which mediates the action of this vitamin on cells (PubMed:28698609, PubMed:16913708, PubMed:15728261, PubMed:10678179). Enters the nucleus upon vitamin D3 binding where it forms heterodimers with the retinoid X receptor/RXR (PubMed:28698609). The VDR-RXR heterodimers bind to specific response elements on DNA and activate the transcription of vitamin D3-responsive target genes (PubMed:28698609). Plays a central role in calcium homeostasis (By similarity).\n                         VDR_HUMAN,P11473\n                         
## 2 A cytochrome P450 monooxygenase involved in vitamin D metabolism and in calcium and phosphorus homeostasis. Catalyzes the rate-limiting step in the activation of vitamin D in the kidney, namely the hydroxylation of 25-hydroxyvitamin D3/calcidiol at the C1alpha-position to form the hormonally active form of vitamin D3, 1alpha,25-dihydroxyvitamin D3/calcitriol that acts via the vitamin D receptor (VDR) (PubMed:10518789, PubMed:9486994, PubMed:22862690, PubMed:10566658, PubMed:12050193). Has 1alpha-hydroxylase activity on vitamin D intermediates of the CYP24A1-mediated inactivation pathway (PubMed:10518789, PubMed:22862690). Converts 24R,25-dihydroxyvitamin D3/secalciferol to 1-alpha,24,25-trihydroxyvitamin D3, an active ligand of VDR. Also active on 25-hydroxyvitamin D2 (PubMed:10518789). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via FDXR/adrenodoxin reductase and FDX1/adrenodoxin (PubMed:22862690).\n                         CP27B_HUMAN,O15528\n                         
## 3                                                                                                                                                                                                                                                                                                                                                           Peptidase that cleaves SIBLING (small integrin-binding ligand, N-linked glycoprotein)-derived ASARM peptides, thus regulating their biological activity (PubMed:9593714, PubMed:15664000, PubMed:18162525, PubMed:18597632). Cleaves ASARM peptides between Ser and Glu or Asp residues (PubMed:18597632). Regulates osteogenic cell differentiation and bone mineralization through the cleavage of the MEPE-derived ASARM peptide (PubMed:18597632). Promotes dentin mineralization and renal phosphate reabsorption by cleaving DMP1- and MEPE-derived ASARM peptides (PubMed:18597632, PubMed:18162525). Inhibits the cleavage of MEPE by CTSB/cathepsin B thus preventing MEPE degradation (PubMed:12220505).\n                         PHEX_HUMAN,P78562\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    This is a copper-containing oxidase that functions in the formation of pigments such as melanins and other polyphenolic compounds. Catalyzes the initial and rate limiting step in the cascade of reactions leading to melanin production from tyrosine. In addition to hydroxylating tyrosine to DOPA (3,4-dihydroxyphenylalanine), also catalyzes the oxidation of DOPA to DOPA-quinone, and possibly the oxidation of DHI (5,6-dihydroxyindole) to indole-5,6 quinone.\n                         TYRO_HUMAN,P14679\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Plays a role in melanin biosynthesis (PubMed:22556244, PubMed:16704458). Catalyzes the oxidation of 5,6-dihydroxyindole-2-carboxylic acid (DHICA) into indole-5,6-quinone-2-carboxylic acid in the presence of bound Cu(2+) ions, but not in the presence of Zn(2+) (PubMed:28661582). May regulate or influence the type of melanin synthesized (PubMed:22556244, PubMed:16704458). Also to a lower extent, capable of hydroxylating tyrosine and producing melanin (By similarity).\n                         TYRP1_HUMAN,P17643\n                         
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        Could be involved in the transport of tyrosine, the precursor to melanin synthesis, within the melanocyte. Regulates the pH of melanosome and the melanosome maturation. One of the components of the mammalian pigmentary system. Seems to regulate the post-translational processing of tyrosinase, which catalyzes the limiting reaction in melanin synthesis. May serve as a key control point at which ethnic skin color variation is determined. Major determinant of brown and/or blue eye color.\n                         P_HUMAN,Q04671\n

Combine the vitamin searched with the gene from the last two data frames.

vitamins <- merge(vitMinSter,vitMinSterSumms2,
                  by.x='proteinType',
                  by.y='gene')
vitamins2 <- vitamins[!duplicated(vitamins),]
head(vitamins2)

##   proteinType proteinSearched       EnsemblID
## 1       AANAT       melatonin ENSG00000129673
## 2        ANKH         calcium ENSG00000154122
## 3       APOA1        fish-oil ENSG00000118137
## 4        APOB        fish-oil ENSG00000084674
## 5     CACNA1B        omega-3s ENSG00000148408
## 6       CALCA     calcitonine ENSG00000110680
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   EntrezSummary
## 1                                                                                                                                                                                   The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes a multipass transmembrane protein that is expressed in joints and other tissues and controls pyrophosphate levels in cultured cells. Progressive ankylosis-mediated control of pyrophosphate levels has been suggested as a possible mechanism regulating tissue calcification and susceptibility to arthritis in higher animals. Mutations in this gene have been associated with autosomal dominant craniometaphyseal dysplasia. [provided by RefSeq, Jul 2008]
## 3                                                                                                                     This gene encodes apolipoprotein A-I, which is the major protein component of high density lipoprotein (HDL) in plasma. The encoded preproprotein is proteolytically processed to generate the mature protein, which promotes cholesterol efflux from tissues to the liver for excretion, and is a cofactor for lecithin cholesterolacyltransferase (LCAT), an enzyme responsible for the formation of most plasma cholesteryl esters. This gene is closely linked with two other apolipoprotein genes on chromosome 11. Defects in this gene are associated with HDL deficiencies, including Tangier disease, and with systemic non-neuropathic amyloidosis. Alternative splicing results in multiple transcript variants, at least one of which encodes a preproprotein. [provided by RefSeq, Dec 2015]
## 4 This gene product is the main apolipoprotein of chylomicrons and low density lipoproteins (LDL), and is the ligand for the LDL receptor. It occurs in plasma as two main isoforms, apoB-48 and apoB-100: the former is synthesized exclusively in the gut and the latter in the liver. The intestinal and the hepatic forms of apoB are encoded by a single gene from a single, very long mRNA. The two isoforms share a common N-terminal sequence. The shorter apoB-48 protein is produced after RNA editing of the apoB-100 transcript at residue 2180 (CAA->UAA), resulting in the creation of a stop codon, and early translation termination. Mutations in this gene or its regulatory region cause hypobetalipoproteinemia, normotriglyceridemic hypobetalipoproteinemia, and hypercholesterolemia due to ligand-defective apoB, diseases affecting plasma cholesterol and apoB levels. [provided by RefSeq, Dec 2019]
## 5                                                                                                                                                                                                                                                                                                                                                                                                    The protein encoded by this gene is the pore-forming subunit of an N-type voltage-dependent calcium channel, which controls neurotransmitter release from neurons. The encoded protein forms a complex with alpha-2, beta, and delta subunits to form the high-voltage activated channel. This channel is sensitive to omega-conotoxin-GVIA and omega-agatoxin-IIIA but insensitive to dihydropyridines. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Aug 2011]
## 6                                                                                                                                                                                                                                                                                                                                                  This gene encodes the peptide hormones calcitonin, calcitonin gene-related peptide and katacalcin by tissue-specific alternative RNA splicing of the gene transcripts and cleavage of inactive precursor proteins. Calcitonin is involved in calcium regulation and acts to regulate phosphorus metabolism. Calcitonin gene-related peptide functions as a vasodilator and as an antimicrobial peptide while katacalcin is a calcium-lowering peptide. Multiple transcript variants encoding different isoforms have been found for this gene.[provided by RefSeq, Aug 2014]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                GeneCardsSummary
## 1                                                                                                                                                                                                                    AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene.                                            Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome.                                            Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism.                                            Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.                                            
## 2                                             ANKH (ANKH Inorganic Pyrophosphate Transport Regulator) is a Protein Coding gene.                                            Diseases associated with ANKH include Craniometaphyseal Dysplasia, Autosomal Dominant and Chondrocalcinosis 2.                                            Among its related pathways are Transport of glucose and other sugars, bile salts and organic acids, metal ions and amine compounds and Miscellaneous transport and binding events.                                            Gene Ontology (GO) annotations related to this gene include inorganic phosphate transmembrane transporter activity and inorganic diphosphate transmembrane transporter activity.                                            
## 3                                                                                                                                                                                                             APOA1 (Apolipoprotein A1) is a Protein Coding gene.                                            Diseases associated with APOA1 include Hypoalphalipoproteinemia, Primary, 2 and Amyloidosis, Familial Visceral.                                            Among its related pathways are Lipoprotein metabolism and Folate Metabolism.                                            Gene Ontology (GO) annotations related to this gene include identical protein binding and lipid binding.                                            An important paralog of this gene is APOA4.
## 4                                                                                                                                                                                                                                                                APOB (Apolipoprotein B) is a Protein Coding gene.                                            Diseases associated with APOB include Hypobetalipoproteinemia, Familial, 1 and Hypercholesterolemia, Familial, 2.                                            Among its related pathways are Activated TLR4 signalling and Lipoprotein metabolism.                                            Gene Ontology (GO) annotations related to this gene include binding and heparin binding.                                            
## 5                                                                                    CACNA1B (Calcium Voltage-Gated Channel Subunit Alpha1 B) is a Protein Coding gene.                                            Diseases associated with CACNA1B include Neurodevelopmental Disorder With Seizures And Nonepileptic Hyperkinetic Movements and Undetermined Early-Onset Epileptic Encephalopathy.                                            Among its related pathways are Nicotine addiction and ADP signalling through P2Y purinoceptor 12.                                            Gene Ontology (GO) annotations related to this gene include calcium ion binding and ion channel activity.                                            An important paralog of this gene is CACNA1A.
## 6                                                                                                                                                                                                                                             CALCA (Calcitonin Related Polypeptide Alpha) is a Protein Coding gene.                                            Diseases associated with CALCA include Reflex Sympathetic Dystrophy and Spinal Stenosis.                                            Among its related pathways are Neuroscience and Signaling by GPCR.                                            Gene Ontology (GO) annotations related to this gene include identical protein binding.                                            An important paralog of this gene is CALCB.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n                         SNAT_HUMAN,Q16613\n                         
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Regulates intra- and extracellular levels of inorganic pyrophosphate (PPi), probably functioning as PPi transporter.\n                         ANKH_HUMAN,Q9HCJ1\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                   Participates in the reverse transport of cholesterol from tissues to the liver for excretion by promoting cholesterol efflux from tissues and by acting as a cofactor for the lecithin cholesterol acyltransferase (LCAT). As part of the SPAP complex, activates spermatozoa motility.\n                         APOA1_HUMAN,P02647\n                         
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                        Apolipoprotein B is a major protein constituent of chylomicrons (apo B-48), LDL (apo B-100) and VLDL (apo B-100). Apo B-100 functions as a recognition signal for the cellular binding and internalization of LDL particles by the apoB/E receptor.\n                         APOB_HUMAN,P04114\n                         
## 5 Voltage-sensitive calcium channels (VSCC) mediate the entry of calcium ions into excitable cells and are also involved in a variety of calcium-dependent processes, including muscle contraction, hormone or neurotransmitter release, gene expression, cell motility, cell division and cell death. The isoform alpha-1B gives rise to N-type calcium currents. N-type calcium channels belong to the 'high-voltage activated' (HVA) group and are specifically blocked by omega-conotoxin-GVIA (AC P01522) (AC P01522) (By similarity). They are however insensitive to dihydropyridines (DHP). Calcium channels containing alpha-1B subunit may play a role in directed migration of immature neurons.\n                         CAC1B_HUMAN,Q00975\n                         
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                              CGRP induces vasodilation. It dilates a variety of vessels including the coronary, cerebral and systemic vasculature. Its abundance in the CNS also points toward a neurotransmitter or neuromodulator role. It also elevates platelet cAMP.\n                         CALCA_HUMAN,P06881\n

Lets now combine this dataframe with the Ensembl ID of the dataAllFCs dataframe.

vitaminsDF <- merge(vitamins2,dataAllFCs,
                    by.x='EnsemblID',
                    by.y='ENSEMBLID')
head(vitaminsDF)

##         EnsemblID proteinType proteinSearched
## 1 ENSG00000004948       CALCR     calcitonine
## 2 ENSG00000007171        NOS2        omega-3s
## 3 ENSG00000010704         HFE            iron
## 4 ENSG00000036828        CASR         calcium
## 5 ENSG00000064835      POU1F1       prolactin
## 6 ENSG00000064989      CALCRL     calcitonine
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         EntrezSummary
## 1                                                                                                                                                       This gene encodes a high affinity receptor for the peptide hormone calcitonin and belongs to a subfamily of seven transmembrane-spanning G protein-coupled receptors. The encoded protein is involved in maintaining calcium homeostasis and in regulating osteoclast-mediated bone resorption. Polymorphisms in this gene have been associated with variations in bone mineral density and onset of osteoporosis. Alternate splicing results in multiple transcript variants. [provided by RefSeq, Sep 2009]
## 2                                                                                                                                                                                                      Nitric oxide is a reactive free radical which acts as a biologic mediator in several processes, including neurotransmission and antimicrobial and antitumoral activities. This gene encodes a nitric oxide synthase which is expressed in liver and is inducible by a combination of lipopolysaccharide and certain cytokines. Three related pseudogenes are located within the Smith-Magenis syndrome region on chromosome 17. [provided by RefSeq, Jul 2008]
## 3                    The protein encoded by this gene is a membrane protein that is similar to MHC class I-type proteins and associates with beta2-microglobulin (beta2M). It is thought that this protein functions to regulate iron absorption by regulating the interaction of the transferrin receptor with transferrin. The iron storage disorder, hereditary haemochromatosis, is a recessive genetic disorder that results from defects in this gene. At least nine alternatively spliced variants have been described for this gene. Additional variants have been found but their full-length nature has not been determined. [provided by RefSeq, Jul 2008]
## 4                                                                                    The protein encoded by this gene is a plasma membrane G protein-coupled receptor that senses small changes in circulating calcium concentration. The encoded protein couples this information to intracellular signaling pathways that modify parathyroid hormone secretion or renal cation handling, and thus this protein plays an essential role in maintaining mineral ion homeostasis. Mutations in this gene are a cause of familial hypocalciuric hypercalcemia, neonatal severe hyperparathyroidism, and autosomal dominant hypocalcemia. [provided by RefSeq, Aug 2017]
## 5                                                                                                                                                                                                                                              This gene encodes a member of the POU family of transcription factors that regulate mammalian development. The protein regulates expression of several genes involved in pituitary development and hormone expression. Mutations in this genes result in combined pituitary hormone deficiency. Multiple transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2008]
## 6                                             CALCRL (Calcitonin Receptor Like Receptor) is a Protein Coding gene.                                            Diseases associated with CALCRL include Lymphatic Malformation 8 and Primary Angle-Closure Glaucoma.                                            Among its related pathways are Signaling by GPCR and G alpha (s) signalling events.                                            Gene Ontology (GO) annotations related to this gene include G protein-coupled receptor activity and protein transporter activity.                                            An important paralog of this gene is CALCR.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          GeneCardsSummary
## 1                                                                             CALCR (Calcitonin Receptor) is a Protein Coding gene.                                            Diseases associated with CALCR include Osteoporosis and Bone Mineral Density Quantitative Trait Locus 15.                                            Among its related pathways are Signaling by GPCR and G alpha (s) signalling events.                                            Gene Ontology (GO) annotations related to this gene include G protein-coupled receptor activity and transmembrane signaling receptor activity.                                            An important paralog of this gene is CALCRL.
## 2                                                                                                                                             NOS2 (Nitric Oxide Synthase 2) is a Protein Coding gene.                                            Diseases associated with NOS2 include Malaria and Meningioma, Radiation-Induced.                                            Among its related pathways are Tuberculosis and VEGF Signaling.                                            Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity.                                            An important paralog of this gene is NOS1.
## 3                                                                                                           HFE (Homeostatic Iron Regulator) is a Protein Coding gene.                                            Diseases associated with HFE include Hemochromatosis, Type 1 and Transferrin Serum Level Quantitative Trait Locus 2.                                            Among its related pathways are Hfe effect on hepcidin production.                                            Gene Ontology (GO) annotations related to this gene include signaling receptor binding and peptide antigen binding.                                            An important paralog of this gene is HLA-A.
## 4                                                                                                   CASR (Calcium Sensing Receptor) is a Protein Coding gene.                                            Diseases associated with CASR include Hypocalcemia, Autosomal Dominant 1 and Hyperparathyroidism, Neonatal Severe.                                            Among its related pathways are RET signaling and Signaling by GPCR.                                            Gene Ontology (GO) annotations related to this gene include G protein-coupled receptor activity and protein kinase binding.                                            An important paralog of this gene is GPRC6A.
## 5                                             POU1F1 (POU Class 1 Homeobox 1) is a Protein Coding gene.                                            Diseases associated with POU1F1 include Pituitary Hormone Deficiency, Combined, 1 and Isolated Growth Hormone Deficiency, Type Ii.                                            Among its related pathways are Relaxin signaling pathway and Glucocorticoid receptor regulatory network.                                            Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and chromatin binding.                                            An important paralog of this gene is POU3F3.
## 6                                                                                 CALCRL (Calcitonin Receptor Like Receptor) is a Protein Coding gene.                                            Diseases associated with CALCRL include Lymphatic Malformation 8 and Primary Angle-Closure Glaucoma.                                            Among its related pathways are Signaling by GPCR and G alpha (s) signalling events.                                            Gene Ontology (GO) annotations related to this gene include G protein-coupled receptor activity and protein transporter activity.                                            An important paralog of this gene is CALCR.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    UniProtKB_Summary
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  This is a receptor for calcitonin. The activity of this receptor is mediated by G proteins which activate adenylyl cyclase. The calcitonin receptor is thought to couple to the heterotrimeric guanosine triphosphate-binding protein that is sensitive to cholera toxin.\n                         CALCR_HUMAN,P30988\n                         
## 2                                                                                                                                                                                                                        Produces nitric oxide (NO) which is a messenger molecule with diverse functions throughout the body (PubMed:7531687, PubMed:7544004). In macrophages, NO mediates tumoricidal and bactericidal actions. Also has nitrosylase activity and mediates cysteine S-nitrosylation of cytoplasmic target proteins such PTGS2/COX2 (By similarity). As component of the iNOS-S100A8/9 transnitrosylase complex involved in the selective inflammatory stimulus-dependent S-nitrosylation of GAPDH on 'Cys-247' implicated in regulation of the GAIT complex activity and probably multiple targets including ANXA5, EZR, MSN and VIM (PubMed:25417112). Involved in inflammation, enhances the synthesis of proinflammatory mediators such as IL6 and IL8 (PubMed:19688109).\n                         NOS2_HUMAN,P35228\n                         
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    Binds to transferrin receptor (TFR) and reduces its affinity for iron-loaded transferrin.\n                         HFE_HUMAN,Q30201\n                         
## 4 G-protein-coupled receptor that senses changes in the extracellular concentration of calcium ions and plays a key role in maintaining calcium homeostasis (PubMed:7759551, PubMed:8702647, PubMed:8636323, PubMed:8878438, PubMed:17555508, PubMed:19789209, PubMed:21566075, PubMed:22114145, PubMed:23966241, PubMed:25292184, PubMed:25104082, PubMed:26386835, PubMed:25766501, PubMed:22789683). Senses fluctuations in the circulating calcium concentration and modulates the production of parathyroid hormone (PTH) in parathyroid glands (By similarity). The activity of this receptor is mediated by a G-protein that activates a phosphatidylinositol-calcium second messenger system (PubMed:7759551). The G-protein-coupled receptor activity is activated by a co-agonist mechanism: aromatic amino acids, such as Trp or Phe, act concertedly with divalent cations, such as calcium or magnesium, to achieve full receptor activation (PubMed:27434672, PubMed:27386547).\n                         CASR_HUMAN,P41180\n                         
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Transcription factor involved in the specification of the lactotrope, somatotrope, and thyrotrope phenotypes in the developing anterior pituitary. Specifically binds to the consensus sequence 5'-TAAAT-3'. Activates growth hormone and prolactin genes (PubMed:22010633, PubMed:26612202).\n                         PIT1_HUMAN,P28069\n                         
## 6                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Receptor for calcitonin-gene-related peptide (CGRP) together with RAMP1 and receptor for adrenomedullin together with RAMP3 (By similarity). Receptor for adrenomedullin together with RAMP2. The activity of this receptor is mediated by G proteins which activate adenylyl cyclase.\n                         CALRL_HUMAN,Q16602\n                         
##   convalescent healthy1 healthy2 healthy3 healthy4 healthy5 healthy6 healthy7
## 1            0        2        3        0        0        1        2        4
## 2            3        0        4        0        0        1        2        2
## 3           89       98      146      115       93      149      178       97
## 4            2        7        6        4        1        0        2        5
## 5            0        0        0        0        1        0        2        0
## 6           72       27       50       41       28       78      171       61
##   healthy8 healthy9 healthy10 healthy11 healthy12 healthy13 healthy14 healthy15
## 1        1        0         2         0         2         3         0         2
## 2        4        5         3         0         6         2         1         7
## 3       81      121       111        83        64        68       102        80
## 4        5        1         2         4         2         6         4         2
## 5        0        0         0         0         2         0         0         0
## 6       71       70        59        67        73        66        75        79
##   healthy16 healthy17 moderate1 moderate2 moderate3 moderate4 severe1 severe2
## 1         0         0         3         1         1         0       3       0
## 2         1         2         1         1         3         5       6       2
## 3        71        43        46        65        67        33      79      78
## 4         0         2         5         3         5         5       6       2
## 5         0         0         0         0         0         1       0       0
## 6        56        61        37        10        14        32      23      37
##   severe3 severe4 severe5 severe6 severe7 severe8 ICU1 ICU2 ICU3 ICU4
## 1       0       4       0       0       1       1    0    3    3    1
## 2       5       2       1       0       2       1    0   11    4    1
## 3      57     104      94      95      90      60   51   86   71   65
## 4       5       5       4       4       6       2    0    5   16    4
## 5       0       0       1       1       0       0    0    0    3    0
## 6      47     149      58      71      54      21   43   36   36   37
##   healthy_mean moderate_mean severe_mean ICU_mean mod_health_foldChange
## 1    1.2941176          1.25       1.125     1.75             0.9659091
## 2    2.3529412          2.50       2.375     4.00             1.0625000
## 3  100.0000000         52.75      82.125    68.25             0.5275000
## 4    3.1176471          4.50       4.250     6.25             1.4433962
## 5    0.2941176          0.25       0.250     0.75             0.8500000
## 6   66.6470588         23.25      57.500    38.00             0.3488526
##   sevr_health_foldChange ICU_health_foldChange monotonicIncrease
## 1              0.8693182             1.3522727                 0
## 2              1.0093750             1.7000000                 0
## 3              0.8212500             0.6825000                 0
## 4              1.3632075             2.0047170                 0
## 5              0.8500000             2.5500000                 0
## 6              0.8627538             0.5701677                 0
##   monotonicDecrease
## 1                 0
## 2                 0
## 3                 0
## 4                 0
## 5                 0
## 6                 0

We are ready to make are charts from this data to analyze any pattern, relationships, outlier samples, and make hypothesese on. Lets write this data frame out to csv.

write.csv(vitaminsDF,'vitaminsDF.csv',row.names=FALSE)

Lets also attach the header demographic information and tidy the sample types into one column.

vitDF <- gather(vitaminsDF,key='sampleName',value='sampleValue',7:40)

vitDF2 <- merge(HeaderInformation,vitDF,by.x='CN_new',
                by.y='sampleName')

write.csv(vitDF2,'vitaminsTidyDF.csv',row.names=F)

I have created some visualizations that were uploaded to Tableau Public Server to view for these genes as a bar with fold change values, those genes that are only increasing in fold change by severity, and then those that are only decreasing in fold change by severity from least to worst or moderate to severe to ICU. The scatter plot has the age and gender as added dimensions, and all summaries of the genes are displayed when hovering over that feature.

<a href=‘https://public.tableau.com/profile/janis5126#!/vizhome/vitMinSterAllGenesBarchart/vitMinSterAllGenesBarchart?publish=yes’,target=‘blank’>All genes as a bar chart of fold change values then a scatter with raw values as a comparision with age and gender and sample ID:

barchart all Vitamin Genes’ Fold Change Values

A scatter plot by age, gender, and raw sample ID values with the gene and gene summary information:

vitamin Genes’ scatter by age and gender

A bar chart of those genes that are in the vitamin body systems as a bar chart of fold change values by severity of COVID-19 for those genes that are only increasing in fold change from lowest to worst severity.

bar chart monotonically increasing vitamin genes

scatter plot of the monotonically increasing vitamin genes

bar chart of those vitamin genes that are monotonically decreasing

Scatter plot of those vitamin genes monotonically decreasing

scatter plot of those vitamin genes monotonically decreasing

There are many relationships that can be seen and understood by looking at the gene trends within the proteins and those increasing or decreasing by severity of COVID-19 when viewing the interactive plots in Tableau at the links provided before each image above.

After adding the iron genes to our data, it was charted by fold change values and also by raw values, age, gender, and severity of COVID-19 disease.

Scatter of gene values by severity, age, and gender

scatter of genes

The image above is the chart of genes in Tableau that has only the iron actual raw values for its top 3 ranked genes in our samples. We can see that the first gene is much higher over all in expression in healthy samples, and that in all three grades of severity of COVID-19 the iron levels are on the smaller range decreasing with age in both genders and all cases excep the ICU grade for females where it increased with age. The 2nd iron gene is similar except that iron increases with age in females. The last iron gene shows that it is lower in healthy samples and than the other two iron genes, but in ICU grades of COVID-19 it is increased.

non-monotonic gene fold changes-iron

nonMonotonic Genes-Iron

This image above is of those iron genes that aren’t monotonically increasing or decreasing. There fold change values are shown for each gene and all are under expressed in the disease state compared to the healthy state. We created the rules for these fold change values so that if the disease state mean was 0 and the healthy mean was not 0, the fold change would be 1-healthy mean, and if both the disease state and the healthy state means were 0 then the fold change value would be 1, and finally if the healthy mean was 0 but the disease mean was not 0 the value would be the disease state. And if the fold change was not 0, Inf, or NaN, then the value would stay as is from the disease state/healthy state ratio of mean values. We can see from the image of the Tableau chart, that the iron gene was clicked on and selected to ‘keep only’ so that only the iron genes are displayed among those nonmonotonic. HFE and SLC11A2 are shown, and the ICU grade of COVID-19 severity is lower than the severe grade of COVID-19, but the moderate grade is lower than the ICU grade in the HFE gene and higher in the SLC11A2 gene. SLC11A2 is involved in transporter activity and copper ion binding and associated with anemia (not enough red blood cells). While HFE is associated with hemochromatosis (too many red blood cells) and signaling receptor binding as well as peptide antigen binding.

monotonically decreasing fold change values-iron

monotonic increasing fold changes-iron

The image above is of those genes monotonically increasing but with iron selected as the ‘keep only’ gene to be displayed. We can see the description of this gene that is one of the top three ranked iron genes, SLC40A1, as associated with hemochromatosis Type I and Type 4, iron ion transmembrane transporter activity, and its related pathways include mineral absorption and insulin receptor recycling.

monotonically increasing fold change values-iron not included

monotonically increasing fold changes

We already have this chart, because it is the monotonically increasing genes from least to most severe grades of COVID-19, except that iron was added to the data set and coincidentally not in this chart because its three genes are split between the monotonically decreasing and the genes nonmonotonic. Look at the genes that are higher in the least severe case and dramatically lower in the most severe case. Most are hormones, but one is a fish oil gene and a couple of vitamin B12 genes that aren’t dramatically higher in the moderate compared to the ICU grade of COVID-19. Estrogen, progesterone, and testosterone are gender related hormones. The testosterone gene GNRH1 is produced in the hypothalamus which lies in front of the pituitary gland that produces estrogen and progesterone. Synthetic progesterone is consumed when eating legumes and yams. Adrenaline is the sympathetic hormone produced by the adrenal glands and released in times of extreme stress or fight or flight threats to personal safety. If you explore the chart further you will see that TNF is a corticosteroid and vitamin C gene that is dramatically higher in moderate cases than in ICU cases of COVID-19.

monotonicGenes

Janis Corona

8/14/2020