GSE152418, 34 samples, 18 females, 16 males, aged 23-91. One convalescent sample, and a mixed number of samples in moderate, severe, and ICU grades of COVID-19 infection symptoms. The previous files for how this data was obtained are ‘covid19_GSE152418_76patients_1.Rmd’, and ‘2’ and ‘3’ for updated versions. It says 76 patients, because I recopied some code from a prior study using GSE151161 on COVID-19 symptoms that are similar to Rheumatoid Arthritis patients that had 76 samples but actually only had 38 patients.
(8/22/2020) Note, most of this document assumed the convalescent sample was a convalescent misclassified patient and that all samples are convalescent patients, but this is completely wrong. Convalescent is a medical terminology, to mean those healing or someone who has headed or survived a disease and thus generated antibodies. This sample is classified as a healthy sample in the ML done mid document, because it technically is and the range of values in gene expression are in the range of those healthy. This adds a new angle to look at the data, especially the charts of scatter with the ages, and know that the vitamin and hormone genes looked up are not influenced by convalescent homes giving the patients medicine. They are regular (possible citizens and not incarcerated) patients agreeing to have their blood taken and experimented on. I would never have caught this unless I watched a very annoying video and had to listen to it but had to do some exploring of other diseases to get my mind going on other projects (lime disease and ‘ticks’) where the summary used the ‘convalescing blood’ or similar which made me look it up. Its like a pun that retards the meaning completely.
This script is to test the topic modeling algorithm,linear discriminant analysis (lda), on these wide matrices of genes once we convert them to such. Caret will be used for the lda model.
We can use our functions find25genes and the other geneCards2.R script has.
library(caret)
library(dplyr)
library(tidyr)
monotonicIncrease <- read.csv('monotonicIncrease.csv',sep=',',
na.strings=c('',' ','NA'),header=T, stringsAsFactors = F)
monotonicDecrease <- read.csv('monotonicDecrease.csv',sep=',',
na.strings=c('',' ','NA'),header=T, stringsAsFactors = F)
HeaderInformation <- read.csv('HeaderInformation.csv',sep=',',
header=T, na.strings=c('',' ','NA'),
stringsAsFactors = F)
head(HeaderInformation)
## GSM_ID CN_old CN_new age gender diseaseState severity
## 1 GSM4615003 S061_257 healthy1 25 M Healthy Healthy
## 2 GSM4615006 S062_258 healthy2 70 F Healthy Healthy
## 3 GSM4615008 S063_259 healthy3 68 F Healthy Healthy
## 4 GSM4615011 S064_260 healthy4 69 M Healthy Healthy
## 5 GSM4615014 S065_261 healthy5 29 F Healthy Healthy
## 6 GSM4615016 S066_265 healthy6 90 M Healthy Healthy
## geographicLocation cellType
## 1 Atlanta, GA, USA PBMC
## 2 Atlanta, GA, USA PBMC
## 3 Atlanta, GA, USA PBMC
## 4 Atlanta, GA, USA PBMC
## 5 Atlanta, GA, USA PBMC
## 6 Atlanta, GA, USA PBMC
monotonicIncreaseOrder10 <- monotonicIncrease[order(monotonicIncrease$ICU_health_foldChange,decreasing=T)[1:10],]
monotonicDecreaseOrder10 <- monotonicDecrease[order(monotonicDecrease$ICU_health_foldChange, decreasing=T)[1:10],]
colnames(monotonicIncreaseOrder10)
## [1] "ENSEMBLID" "convalescent" "healthy1"
## [4] "healthy2" "healthy3" "healthy4"
## [7] "healthy5" "healthy6" "healthy7"
## [10] "healthy8" "healthy9" "healthy10"
## [13] "healthy11" "healthy12" "healthy13"
## [16] "healthy14" "healthy15" "healthy16"
## [19] "healthy17" "moderate1" "moderate2"
## [22] "moderate3" "moderate4" "severe1"
## [25] "severe2" "severe3" "severe4"
## [28] "severe5" "severe6" "severe7"
## [31] "severe8" "ICU1" "ICU2"
## [34] "ICU3" "ICU4" "healthy_mean"
## [37] "moderate_mean" "severe_mean" "ICU_mean"
## [40] "mod_health_foldChange" "sevr_health_foldChange" "ICU_health_foldChange"
## [43] "monotonicIncrease" "monotonicDecrease"
Lets only keep the actual samples for each class and set aside the convalescent sole sample.
convalescentDecr <- monotonicDecreaseOrder10$convalescent
convalescentIncr <- monotonicIncreaseOrder10$convalescent
ensemblID <- monotonicIncreaseOrder10$ENSEMBLID
monotonicIncrease2 <- monotonicIncreaseOrder10[,c(3:35)]
monotonicDecrease2 <- monotonicDecreaseOrder10[,c(3:35)]
sampleType <- c(rep('healthy',17),rep('moderate',4),
rep('severe',8),rep('ICU',4))
Make the two data frames matrices that flip the samples in the header for the rows and genes as the header from the EnsemblID feature.
tIncr <- as.data.frame(t(monotonicIncrease2))
colnames(tIncr) <- ensemblID
tIncr$sample <- sampleType
IncreaseMx <- tIncr[,c(11,1:10)]
set.seed(8989)
inTrain <- createDataPartition(y=IncreaseMx$sample, p=0.7, list=FALSE)
trainingSet <- IncreaseMx[inTrain,]
testingSet <- IncreaseMx[-inTrain,]
dim(trainingSet)
## [1] 24 11
dim(testingSet)
## [1] 9 11
When using all 8088 genes as tokens for lda, it fails with accuracy and kappa values NA. So it was reduced to the top 10 highest fold change values in the ICU/healthy means. I origninally thought we were using the latent dirichlet allocation model of caret, but it is linear discriminant analysis of caret. This is why it fails. I do want to use the latent dirichlet allocation model, but it won’t be in caret. It is another R package.
IncreaseMx$sample <- as.factor(paste(IncreaseMx$sample))
set.seed(505005)
ldaMod <- train(sample~., method='lda', data=trainingSet,
preProc = c("center", "scale"),
trControl=trainControl(method='boot'))
predlda <- predict(ldaMod, testingSet)
DF_ldaI <- data.frame(predlda, type=testingSet$sample)
DF_ldaI
## predlda type
## 1 healthy healthy
## 2 healthy healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 healthy healthy
## 6 severe moderate
## 7 healthy severe
## 8 ICU severe
## 9 ICU ICU
errored1 <- DF_ldaI[(DF_ldaI$predlda != DF_ldaI$type),]
errored1
## predlda type
## 6 severe moderate
## 7 healthy severe
## 8 ICU severe
rn <- as.numeric(row.names(errored1))
rn
## [1] 6 7 8
ldaError <- testingSet[rn,]
ldaError
## sample ENSG00000133742 ENSG00000206177 ENSG00000158578
## moderate1 moderate 3 0 2
## severe1 severe 3 1 10
## severe7 severe 152 136 2513
## ENSG00000223609 ENSG00000204010 ENSG00000169877 ENSG00000170180
## moderate1 1 3 0 4
## severe1 11 0 4 1
## severe7 330 12 100 3
## ENSG00000143416 ENSG00000112077 ENSG00000076864
## moderate1 2 2 2
## severe1 4 0 5
## severe7 270 14 17
Inc_lda_wrong <- row.names(ldaError)
lda_demg <- HeaderInformation[HeaderInformation$CN_new %in% row.names(ldaError),]
lda_demg
## GSM_ID CN_old CN_new age gender diseaseState
## 13 GSM4614986 S147_nCoV001EUHM-Draw-1 moderate1 75 F COVID-19
## 14 GSM4614987 S149_nCoV002EUHM-Draw-2 severe1 54 M COVID-19
## 26 GSM4614999 S178_nCoV028EUHM-Draw-1 severe7 56 M COVID-19
## severity geographicLocation cellType
## 13 Moderate Atlanta, GA, USA PBMC
## 14 Severe Atlanta, GA, USA PBMC
## 26 Severe Atlanta, GA, USA PBMC
The three samples misclassified by lda in our 10 most expressed genes are two males aged 54 and 56 and one female aged 75.
When using the lda on only 10 of the most expressed genes in ICU/healthy mean values and genes that increased in expression with each more severe grade of COVID-19, there was an improvement from 5/9 correct to 6/9 for 67% accuracy.
set.seed(605040)
rf_boot <- train(sample~., method='rf',
na.action=na.pass,
data=(trainingSet), preProc = c("center", "scale"),
trControl=trainControl(method='boot'), number=5)
predRF_boot <- predict(rf_boot, testingSet)
DF_bootI <- data.frame(predRF_boot, type=testingSet$sample)
head(DF_bootI)
## predRF_boot type
## 1 healthy healthy
## 2 healthy healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 healthy healthy
## 6 healthy moderate
errored2 <- DF_bootI[(DF_bootI$predRF_boot != DF_bootI$type),]
errored2
## predRF_boot type
## 6 healthy moderate
## 7 healthy severe
## 8 ICU severe
## 9 severe ICU
Lets see which samples created an error in our random forest model.
rn <- as.numeric(row.names(errored2))
rn
## [1] 6 7 8 9
rfError <- testingSet[rn,]
rfError
## sample ENSG00000133742 ENSG00000206177 ENSG00000158578
## moderate1 moderate 3 0 2
## severe1 severe 3 1 10
## severe7 severe 152 136 2513
## ICU2 ICU 4672 2432 27287
## ENSG00000223609 ENSG00000204010 ENSG00000169877 ENSG00000170180
## moderate1 1 3 0 4
## severe1 11 0 4 1
## severe7 330 12 100 3
## ICU2 7961 1468 1924 43
## ENSG00000143416 ENSG00000112077 ENSG00000076864
## moderate1 2 2 2
## severe1 4 0 5
## severe7 270 14 17
## ICU2 2790 105 496
RF_demg <- HeaderInformation[HeaderInformation$CN_new %in% row.names(rfError),]
RF_demg
## GSM_ID CN_old CN_new age gender diseaseState
## 13 GSM4614986 S147_nCoV001EUHM-Draw-1 moderate1 75 F COVID-19
## 14 GSM4614987 S149_nCoV002EUHM-Draw-2 severe1 54 M COVID-19
## 20 GSM4614993 S155_nCOV021EUHM ICU2 60 F COVID-19
## 26 GSM4614999 S178_nCoV028EUHM-Draw-1 severe7 56 M COVID-19
## severity geographicLocation cellType
## 13 Moderate Atlanta, GA, USA PBMC
## 14 Severe Atlanta, GA, USA PBMC
## 20 ICU Atlanta, GA, USA PBMC
## 26 Severe Atlanta, GA, USA PBMC
The random forest model on the 10 most expressed genes got 4/9 wrong in the classification with two females aged 60 and 75, and two males aged 54 and 56.
Inc_rf_wrong <- row.names(rfError)
Lets try lda with the 10 least expressed genes with monotonic increases across grades of COVID-19.
tDecr <- as.data.frame(t(monotonicDecrease2))
colnames(tDecr) <- ensemblID
tDecr$sample <- sampleType
DecreaseMx <- tDecr[,c(11,1:10)]
set.seed(8989)
inTrain <- createDataPartition(y=DecreaseMx$sample, p=0.7, list=FALSE)
trainingSet <- DecreaseMx[inTrain,]
testingSet <- DecreaseMx[-inTrain,]
dim(trainingSet)
## [1] 24 11
dim(testingSet)
## [1] 9 11
When using all 8088 genes as tokens for lda, it fails with accuracy and kappa values NA. So it was reduced to the top 10 highest fold change values in the ICU/healthy means.
DecreaseMx$sample <- as.factor(paste(DecreaseMx$sample))
set.seed(505005)
ldaMod <- train(sample~., method='lda', data=trainingSet,
preProc = c("center", "scale"),
trControl=trainControl(method='boot'))
predlda <- predict(ldaMod, testingSet)
DF_ldaD <- data.frame(predlda, type=testingSet$sample)
DF_ldaD
## predlda type
## 1 healthy healthy
## 2 healthy healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 ICU healthy
## 6 severe moderate
## 7 ICU severe
## 8 ICU severe
## 9 severe ICU
errored3 <- DF_ldaD[(DF_ldaD$predlda != DF_ldaD$type),]
errored3
## predlda type
## 5 ICU healthy
## 6 severe moderate
## 7 ICU severe
## 8 ICU severe
## 9 severe ICU
rn <- as.numeric(row.names(errored3))
rn
## [1] 5 6 7 8 9
ldaError <- testingSet[rn,]
ldaError
## sample ENSG00000133742 ENSG00000206177 ENSG00000158578
## healthy14 healthy 20 68 0
## moderate1 moderate 6807 1052 3
## severe1 severe 104 655 0
## severe7 severe 1131 1817 3
## ICU2 ICU 381 644 3
## ENSG00000223609 ENSG00000204010 ENSG00000169877 ENSG00000170180
## healthy14 0 0 0 168
## moderate1 20 3 4 13112
## severe1 17 1 1 1573
## severe7 116 1 8 5163
## ICU2 35 1 0 2891
## ENSG00000143416 ENSG00000112077 ENSG00000076864
## healthy14 0 1 0
## moderate1 11 0 0
## severe1 4 1 1
## severe7 8 0 0
## ICU2 12 1 2
decr_lda_wrong <- row.names(ldaError)
Lets use random forest on the decreasing genes data to see if it does any better to predict the ratings.
set.seed(605040)
rf_boot <- train(sample~., method='rf',
na.action=na.pass,
data=(trainingSet), preProc = c("center", "scale"),
trControl=trainControl(method='boot'), number=5)
predRF_boot <- predict(rf_boot, testingSet)
DF_bootD <- data.frame(predRF_boot, type=testingSet$sample)
head(DF_bootD)
## predRF_boot type
## 1 healthy healthy
## 2 healthy healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 healthy healthy
## 6 severe moderate
errored4 <- DF_bootD[(DF_bootD$predRF_boot != DF_bootD$type),]
errored4
## predRF_boot type
## 6 severe moderate
## 8 ICU severe
## 9 severe ICU
Lets see which samples created an error in our random forest model.
rn <- as.numeric(row.names(errored4))
rn
## [1] 6 8 9
rfError <- testingSet[rn,]
rfError
## sample ENSG00000133742 ENSG00000206177 ENSG00000158578
## moderate1 moderate 6807 1052 3
## severe7 severe 1131 1817 3
## ICU2 ICU 381 644 3
## ENSG00000223609 ENSG00000204010 ENSG00000169877 ENSG00000170180
## moderate1 20 3 4 13112
## severe7 116 1 8 5163
## ICU2 35 1 0 2891
## ENSG00000143416 ENSG00000112077 ENSG00000076864
## moderate1 11 0 0
## severe7 8 0 0
## ICU2 12 1 2
decr_rf_wrong <- row.names(rfError)
There was an improvement with the random forest on the decreasing data set’s 10 genes with 6/9 correct. one sample each of moderate, severe and ICU were incorrect.
Lets see the age of those samples and gender with our header information.
RF_demg <- HeaderInformation[HeaderInformation$CN_new %in% row.names(rfError),]
RF_demg
## GSM_ID CN_old CN_new age gender diseaseState
## 13 GSM4614986 S147_nCoV001EUHM-Draw-1 moderate1 75 F COVID-19
## 20 GSM4614993 S155_nCOV021EUHM ICU2 60 F COVID-19
## 26 GSM4614999 S178_nCoV028EUHM-Draw-1 severe7 56 M COVID-19
## severity geographicLocation cellType
## 13 Moderate Atlanta, GA, USA PBMC
## 20 ICU Atlanta, GA, USA PBMC
## 26 Severe Atlanta, GA, USA PBMC
Two females aged 60 and 75 and one male aged 56 were the misclassified samples in the random forest model using the 10 least expressed genes by fold change of ICU/healthy means.
Lets compare which samples were misclassified in our testing set for the 10 least and most expressed genes in rf and lda.
decr_lda_wrong;decr_rf_wrong; Inc_lda_wrong;Inc_rf_wrong
## [1] "healthy14" "moderate1" "severe1" "severe7" "ICU2"
## [1] "moderate1" "severe7" "ICU2"
## [1] "moderate1" "severe1" "severe7"
## [1] "moderate1" "severe1" "severe7" "ICU2"
Ever model misclassified the 1st and 7th severe cases, the 1st moderate case and the 2nd ICU2 case was misclassified in the 10 least expressed genes using lda and the 10 most expressed genes using rf. The 14th healthy case was misclassified in the 10 least expressed genes using the lda model. Lets see the demographics of those five samples.
HeaderInformation[HeaderInformation$CN_new %in% decr_lda_wrong,]
## GSM_ID CN_old CN_new age gender diseaseState
## 13 GSM4614986 S147_nCoV001EUHM-Draw-1 moderate1 75 F COVID-19
## 14 GSM4614987 S149_nCoV002EUHM-Draw-2 severe1 54 M COVID-19
## 20 GSM4614993 S155_nCOV021EUHM ICU2 60 F COVID-19
## 26 GSM4614999 S178_nCoV028EUHM-Draw-1 severe7 56 M COVID-19
## 31 GSM4615034 S183_263 healthy14 23 F Healthy
## severity geographicLocation cellType
## 13 Moderate Atlanta, GA, USA PBMC
## 14 Severe Atlanta, GA, USA PBMC
## 20 ICU Atlanta, GA, USA PBMC
## 26 Severe Atlanta, GA, USA PBMC
## 31 Healthy Atlanta, GA, USA PBMC
This is where recall and precision would come in to play with precision based on number predicted disease over actual disease, and recall is the number of healthy misclassified as disease. We have three classes to choose from.
precision: correctly predicted true positive/(true positive + false positive) recall: correctly predicted true positive/(true positive + false negative)
The rf on 10 most:
DF_bootI
## predRF_boot type
## 1 healthy healthy
## 2 healthy healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 healthy healthy
## 6 healthy moderate
## 7 healthy severe
## 8 ICU severe
## 9 severe ICU
#healthy precision and recall and accuracy:
5/(5+2)
## [1] 0.7142857
5/(5+0)
## [1] 1
(5+2)/(5+2+2+0)
## [1] 0.7777778
#moderate precision and recall and accuracy:
0/(1+0)
## [1] 0
0/(1)
## [1] 0
(0+8)/(0+8+1+0)
## [1] 0.8888889
#severe pecision and recall and accuracy
0/(2+1)
## [1] 0
0/(0+2)
## [1] 0
(0+7)/(0+6+2+1)
## [1] 0.7777778
#ICU precision and recall and accuracy
0/(0+1)
## [1] 0
0/(0+1)
## [1] 0
(0+7)/(0+7+1+1)
## [1] 0.7777778
Manually the precision, recall, and accuracy was calculated for the four classes in the random forest model on the top 10 genes with a testing subset of 9 samples. The recall on the healthy class was 100% because all healthy classes were correctly identified as true positives, and there were no false negatives or healthy classes misclassified as another class. The precision on the healthy class had 2 false positives, that classified 2 classes as healthy when they weren’t. The accuracy on the moderate class was 88%, but there was only 1 class for the modified and none of the predicted classes was a moderate sample So its true negative rate was high for not classifying any classes as moderate.
Lets make a function specific to our data frames to return the precision, recall, and accuracy of these four classes.
precisionRecallAccuracy <- function(df){
colnames(df) <- c('pred','type')
df$pred <- as.character(paste(df$pred))
df$type <- as.character(paste(df$type))
classes <- unique(df$type)
class1a <- as.character(paste(classes[1]))
class2a <- as.character(paste(classes[2]))
class3a <- as.character(paste(classes[3]))
class4a <- as.character(paste(classes[4]))
#correct classes
class1 <- subset(df, df$type==class1a)
class2 <- subset(df, df$type==class2a)
class3 <- subset(df, df$type==class3a)
class4 <- subset(df, df$type==class4a)
#incorrect classes
notClass1 <- subset(df,df$type != class1a)
notClass2 <- subset(df,df$type != class2a)
notClass3 <- subset(df,df$type != class3a)
notClass4 <- subset(df, df$type != class4a)
#true positives (real positives predicted positive)
tp_1 <- sum(class1$pred==class1$type)
tp_2 <- sum(class2$pred==class2$type)
tp_3 <- sum(class3$pred==class3$type)
tp_4 <- sum(class4$pred==class4$type)
#false positives (real negatives predicted positive)
fp_1 <- sum(notClass1$pred==class1a)
fp_2 <- sum(notClass2$pred==class2a)
fp_3 <- sum(notClass3$pred==class3a)
fp_4 <- sum(notClass4$pred==class4a)
#false negatives (real positive predicted negative)
fn_1 <- sum(class1$pred!=class1$type)
fn_2 <- sum(class2$pred!=class2$type)
fn_3 <- sum(class3$pred!=class3$type)
fn_4 <- sum(class4$pred!=class4$type)
#true negatives (real negatives predicted negative)
tn_1 <- sum(notClass1$pred!=class1a)
tn_2 <- sum(notClass2$pred!=class2a)
tn_3 <- sum(notClass3$pred!=class3a)
tn_4 <- sum(notClass4$pred!=class4a)
#precision
p1 <- tp_1/(tp_1+fp_1)
p2 <- tp_2/(tp_2+fp_2)
p3 <- tp_3/(tp_3+fp_3)
p4 <- tp_4/(tp_4+fp_4)
p1 <- ifelse(p1=='NaN',0,p1)
p2 <- ifelse(p2=='NaN',0,p2)
p3 <- ifelse(p3=='NaN',0,p3)
p4 <- ifelse(p4=='NaN',0,p4)
#recall
r1 <- tp_1/(tp_1+fn_1)
r2 <- tp_2/(tp_2+fn_2)
r3 <- tp_3/(tp_3+fn_3)
r4 <- tp_4/(tp_4+fn_4)
r1 <- ifelse(r1=='NaN',0,r1)
r2 <- ifelse(r2=='NaN',0,r2)
r3 <- ifelse(r3=='NaN',0,r3)
r4 <- ifelse(r4=='NaN',0,r4)
#accuracy
ac1 <- (tp_1+tn_1)/(tp_1+tn_1+fp_1+fn_1)
ac2 <- (tp_2+tn_2)/(tp_2+tn_2+fp_2+fn_2)
ac3 <- (tp_3+tn_3)/(tp_3+tn_3+fp_3+fn_3)
ac4 <- (tp_4+tn_4)/(tp_4+tn_4+fp_4+fn_4)
table <- as.data.frame(rbind(c(class1a,p1,r1,ac1),
c(class2a,p2,r2,ac2),
c(class3a,p3,r3,ac3),
c(class4a,p4,r4,ac4)))
colnames(table) <- c('class','precision','recall','accuracy')
return(table)
}
precisionRecallAccuracy(DF_bootI)
## class precision recall accuracy
## 1 healthy 0.714285714285714 1 0.777777777777778
## 2 moderate 0 0 0.888888888888889
## 3 severe 0 0 0.666666666666667
## 4 ICU 0 0 0.777777777777778
Note in the above results how accuracy can be much higher than precision and recall, because if there are 9 samples, and only one class of a sample, than not selecting that class only produces an error in accuracy of 1/9 for that class which is 11% error or 89% accuracy in prediction for that class. As we saw for the ‘moderate’ class when it wasn’t predicted at all but was only in one test sample to predict.
The rf on 10 least:
DF_bootD
## predRF_boot type
## 1 healthy healthy
## 2 healthy healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 healthy healthy
## 6 severe moderate
## 7 severe severe
## 8 ICU severe
## 9 severe ICU
precisionRecallAccuracy(DF_bootD)
## class precision recall accuracy
## 1 healthy 1 1 1
## 2 moderate 0 0 0.888888888888889
## 3 severe 0.333333333333333 0.5 0.666666666666667
## 4 ICU 0 0 0.777777777777778
In the above results, the ‘severe’ class only got 1 out of 3 predicted samples as ‘severe’ right so it received a 33% in precision, and out of all classes that were ‘severe’, it only got 1 out of 2 of them correct and missed one class so its recall was 50%.
The lda on 10 most:
DF_ldaI
## predlda type
## 1 healthy healthy
## 2 healthy healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 healthy healthy
## 6 severe moderate
## 7 healthy severe
## 8 ICU severe
## 9 ICU ICU
precisionRecallAccuracy(DF_ldaI)
## class precision recall accuracy
## 1 healthy 0.833333333333333 1 0.888888888888889
## 2 moderate 0 0 0.888888888888889
## 3 severe 0 0 0.666666666666667
## 4 ICU 0.5 1 0.888888888888889
The lda on 10 least:
DF_ldaD
## predlda type
## 1 healthy healthy
## 2 healthy healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 ICU healthy
## 6 severe moderate
## 7 ICU severe
## 8 ICU severe
## 9 severe ICU
precisionRecallAccuracy(DF_ldaD)
## class precision recall accuracy
## 1 healthy 1 0.8 0.888888888888889
## 2 moderate 0 0 0.888888888888889
## 3 severe 0 0 0.555555555555556
## 4 ICU 0 0 0.555555555555556
We know that we need to eliminate some of these genes from our most and least expressed gene set of 10 genes. We will do that later and test our prediction accuracy.
This script was designed to see if the lda model would allow a very wide and short data frame or matrix to be used as a tokenized natural language processing data frame but for genes. Unfortunately, I used the linear discriminant analysis (lda) model of caret and so it would fail because of the curse of dimensionality and having more features than samples. Right from the start our 8089X34 dimension matrix failed the lda modeling for missing accuracy and kappa values. I didn’t tune the model at all or use any hyperparameters other than the center and scaling features of the preprocessing attribute in caret. It could possibly work if those hyperparameters were tuned. But the huge curse of dimensionality type matrix will never work in linear discriminant analysis. It would divide 8800+ features from one data frame into 4 groups with linear boundaries. Imagine how it fails. Latent dirichlet allocation is an NLP algorithm, thats for natural language processing, and this lda kind of model could use in simulating our gene numeric data as integers for counts of ‘words’ known as tokens instead of weighted counts like the Term Frequency-Inverse Document Frequency use of words. https://www.tidytextmining.com/topicmodeling.html is a good source to test out the LDA. I have used it before in other scripts, but it has been a while and thought caret had the package from previous work and information on caret. When I wanted to use linear discriminant analysis I couldn’t find it and used Latent Dirichlet Allocation. Now when I want to use latent dirichlet allocation it’s not the package I thought it was.
Instead, we explored the 10 most and least expressed genes that are also increasing or decreasing monotonically from lowest severe case of COVID-19 to most severe case of COVID-19. The machine learning classification was about the same as our other work immediately preceding this script that used the 20 most and least expressed genes by fold change only. Also, this script didn’t use the functions in our geneFunctions2.R script to merge these genes with the Human Genome Nomenclature (HGNC) gene symbol we are more familiar with instead it kept the Ensembl gene symbol.
We should look at this data in Tableau and see if some genes pop out as useful or not, and also consider those samples that threw off the classifications, and possibly tuning or testing out on different hyperparameters of our ML models to see if we can improve. But doing the resampling could lead to overfitting if we generalize too much it won’t predict the classes well.
Accuracy of the model overall could be improved, the healthy samples naturally had a better precision and recall as they had more samples out of all the classes. A few samples in particular were of two females and a male that were 56 or older, we could try leaving those samples out to see if they are outliers or data that skews the mean values and ultimately the fold change values.
The topics are defined from a library to to use Latent Dirichlet Allocation, the same wide matrices can be used with a grid search and tokenized words used in recommender systems and sentiment review but with random forest, neural nets, and gradient boosted machines. These caret algorithms were very fast compared to how they have behaved with my work in the past. I recommend trying out Latent Dirichlet to model the classes as topics or using the tokenized formats of sentiment analysis and recommender systems’ machine learning algorithms. My next extension of this study will explore those areas. Possibly with Python using reticulate to access Python in R.
I made a couple of Tableau charts and uploaded them to the Tableau Public Server to share.
The first chart is a comparison of the mean values across all three classes of COVID-19 with a highlighted fold change value of the ratio of disease to healthy means and are monotonically decreasing with fold change values at least 30.
monotonically Decreasing Genes
The next chart is a chart of the most monotonically increasing fold change expression genes with at least 30 in fold change.
monotonically increasing genes
Each link will take you to the image’s chart to hover the details. What is missing is the details of the Ensembl ID’s HGNC gene symbol with annotations for what those genes do. Lets import those charts data that was done earlier and saved to this document’s folder contents.
mDecr <- read.csv('monotonicDecrease_Full_Data_data.csv',sep=',',
header=T, na.strings=c('',' ','NA'),
stringsAsFactors = F)
head(mDecr)
## ï..Moderate.Mean Severe.Mean ICU.mean Ensemblid Healthy.Mean
## 1 2929.50 877.625 443.00 ENSG00000282122 45.3529400
## 2 11.75 7.625 3.50 ENSG00000188403 0.2352941
## 3 1644.00 68.000 41.25 ENSG00000253818 13.0588200
## 4 143.50 143.000 27.25 ENSG00000253709 1.4705880
## 5 1490.00 238.000 209.00 ENSG00000211644 24.2352900
## 6 4823.00 1991.750 1405.75 ENSG00000211950 52.6470600
## ICU.health.foldChange mod.health.foldChange sevr.health.foldChange
## 1 9.767834 64.59339 19.351010
## 2 14.875000 49.93750 32.406250
## 3 3.158784 125.89190 5.207207
## 4 18.530000 97.58000 97.240000
## 5 8.623786 61.48058 9.820388
## 6 26.701400 91.61006 37.832120
mIncr <- read.csv('monotonicIncrease_Full_Data_data.csv',sep=',',
header=T, na.strings=c('',' ','NA'),
stringsAsFactors = F)
head(mIncr)
## ï..Moderate.Mean Severe.Mean ICU.mean Ensemblid Healthy.Mean
## 1 32.00 51.000 157.00 ENSG00000076864 1.647059
## 2 261.25 338.750 1049.75 ENSG00000143416 8.411765
## 3 300.00 572.125 1989.25 ENSG00000211625 37.294118
## 4 5.00 597.625 600.50 ENSG00000087116 7.352941
## 5 8.25 39.250 157.25 ENSG00000138772 4.941176
## 6 99.25 539.125 1003.00 ENSG00000282651 28.176471
## ICU.health.foldChange mod.health.foldChange sevr.health.foldChange
## 1 95.32143 19.428571 30.964286
## 2 124.79545 31.057692 40.270979
## 3 53.33951 8.044164 15.340891
## 4 81.66800 0.680000 81.277000
## 5 31.82440 1.669643 7.943452
## 6 35.59708 3.522443 19.133873
Lets change the 1st column name to ‘Moderate.Mean’ in both data frames to remove that symbol changed in downloading the Tableau chart data.
colnames(mDecr)[1] <- 'Moderate.Mean'
colnames(mIncr)[1] <- 'Moderate.Mean'
Now lets source our functions to grab the gene symbols from genecards.org and the annotations.
source('geneCards2.R')
## Warning: package 'rvest' was built under R version 3.6.3
## Loading required package: xml2
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
Lets make a list of the Ensembl genes from both tables. They are not the same genes because if you are monotonically increasing as a gene, you cannot also monotonically decrease from least to most severe cases of COVID-19 sampled fold change values.
decrList <- as.character(mDecr$Ensemblid)
incrList <- as.character(mIncr$Ensemblid)
geneList <- as.character(c(decrList,incrList))
geneList
## [1] "ENSG00000282122" "ENSG00000188403" "ENSG00000253818" "ENSG00000253709"
## [5] "ENSG00000211644" "ENSG00000211950" "ENSG00000165949" "ENSG00000076864"
## [9] "ENSG00000143416" "ENSG00000211625" "ENSG00000087116" "ENSG00000138772"
## [13] "ENSG00000282651" "ENSG00000069535" "ENSG00000170439" "ENSG00000004939"
for (i in geneList){
find25genes(i)
}
files <- list.files('./gene scrapes')[1:16]
files
for (i in files){
file <- paste('./gene scrapes',i, sep='/')
t <- read.csv(file,sep=',',header=F)
write.table(t,file='monotonic1genes.csv',sep=',',append=T,col.names=F,row.names=F)
}
monotonic1genes <- read.delim('monotonic1genes.csv',header=F,sep=',',na.strings=c('',' ','NA'))
monotonic1genes$V2 <- toupper(monotonic1genes$V2)
colnames(monotonic1genes) <- c('gene','EnsemblGene','DateSourced')
monotonic1genes <- monotonic1genes[,-3]
head(monotonic1genes)
## gene EnsemblGene
## 1 SLC4A1 ENSG00000004939
## 2 MAOB ENSG00000069535
## 3 RAP1GAP ENSG00000076864
## 4 ADAMTS2 ENSG00000087116
## 5 ANXA3 ENSG00000138772
## 6 SELENBP1 ENSG00000143416
We will extract the gene summaries now.
for (i in monotonic1genes$gene){
getSummaries(i,'PBMC')
}
getGeneSummaries('PBMC')
monotonicSumms <- read.csv("proteinGeneSummaries_pbmc.csv",sep=',',
header=T, na.strings=c('',' ','NA'),
stringsAsFactors = F)
monotonicSumms <- monotonicSumms[,2:5]
head(monotonicSumms)
## gene
## 1 SLC4A1
## 2 MAOB
## 3 RAP1GAP
## 4 ADAMTS2
## 5 ANXA3
## 6 SELENBP1
## EntrezSummary
## 1 The protein encoded by this gene is part of the anion exchanger (AE) family and is expressed in the erythrocyte plasma membrane, where it functions as a chloride/bicarbonate exchanger involved in carbon dioxide transport from tissues to lungs. The protein comprises two domains that are structurally and functionally distinct. The N-terminal 40kDa domain is located in the cytoplasm and acts as an attachment site for the red cell skeleton by binding ankyrin. The glycosylated C-terminal membrane-associated domain contains 12-14 membrane spanning segments and carries out the stilbene disulphonate-sensitive exchange transport of anions. The cytoplasmic tail at the extreme C-terminus of the membrane domain binds carbonic anhydrase II. The encoded protein associates with the red cell membrane protein glycophorin A and this association promotes the correct folding and translocation of the exchanger. This protein is predominantly dimeric but forms tetramers in the presence of ankyrin. Many mutations in this gene are known in man, and these mutations can lead to two types of disease: destabilization of red cell membrane leading to hereditary spherocytosis, and defective kidney acid secretion leading to distal renal tubular acidosis. Other mutations that do not give rise to disease result in novel blood group antigens, which form the Diego blood group system. Southeast Asian ovalocytosis (SAO, Melanesian ovalocytosis) results from the heterozygous presence of a deletion in the encoded protein and is common in areas where Plasmodium falciparum malaria is endemic. One null mutation in this gene is known, resulting in very severe anemia and nephrocalcinosis. [provided by RefSeq, Jul 2008]
## 2 The protein encoded by this gene belongs to the flavin monoamine oxidase family. It is a enzyme located in the mitochondrial outer membrane. It catalyzes the oxidative deamination of biogenic and xenobiotic amines and plays an important role in the metabolism of neuroactive and vasoactive amines in the central nervous sysytem and peripheral tissues. This protein preferentially degrades benzylamine and phenylethylamine. [provided by RefSeq, Jul 2008]
## 3 This gene encodes a type of GTPase-activating-protein (GAP) that down-regulates the activity of the ras-related RAP1 protein. RAP1 acts as a molecular switch by cycling between an inactive GDP-bound form and an active GTP-bound form. The product of this gene, RAP1GAP, promotes the hydrolysis of bound GTP and hence returns RAP1 to the inactive state whereas other proteins, guanine nucleotide exchange factors (GEFs), act as RAP1 activators by facilitating the conversion of RAP1 from the GDP- to the GTP-bound form. In general, ras subfamily proteins, such as RAP1, play key roles in receptor-linked signaling pathways that control cell growth and differentiation. RAP1 plays a role in diverse processes such as cell proliferation, adhesion, differentiation, and embryogenesis. Alternative splicing results in multiple transcript variants encoding distinct proteins. [provided by RefSeq, Aug 2011]
## 4 This gene encodes a member of the ADAMTS (a disintegrin and metalloproteinase with thrombospondin motifs) protein family. Members of the family share several distinct protein modules, including a propeptide region, a metalloproteinase domain, a disintegrin-like domain, and a thrombospondin type 1 (TS) motif. Individual members of this family differ in the number of C-terminal TS motifs, and some have unique C-terminal domains. The encoded preproprotein is proteolytically processed to generate the mature procollagen N-proteinase. This proteinase excises the N-propeptide of the fibrillar procollagens types I-III and type V. Mutations in this gene cause Ehlers-Danlos syndrome type VIIC, a recessively inherited connective-tissue disorder. Alternative splicing results in multiple transcript variants, at least one of which encodes an isoform that is proteolytically processed. [provided by RefSeq, Feb 2016]
## 5 This gene encodes a member of the annexin family. Members of this calcium-dependent phospholipid-binding protein family play a role in the regulation of cellular growth and in signal transduction pathways. This protein functions in the inhibition of phopholipase A2 and cleavage of inositol 1,2-cyclic phosphate to form inositol 1-phosphate. This protein may also play a role in anti-coagulation. [provided by RefSeq, Jul 2008]
## 6 This gene encodes a member of the selenium-binding protein family. Selenium is an essential nutrient that exhibits potent anticarcinogenic properties, and deficiency of selenium may cause certain neurologic diseases. The effects of selenium in preventing cancer and neurologic diseases may be mediated by selenium-binding proteins, and decreased expression of this gene may be associated with several types of cancer. The encoded protein may play a selenium-dependent role in ubiquitination/deubiquitination-mediated protein degradation. Alternatively spliced transcript variants encoding multiple isoforms have been observed for this gene. [provided by RefSeq, Apr 2012]
## GeneCardsSummary
## 1 SLC4A1 (Solute Carrier Family 4 Member 1 (Diego Blood Group)) is a Protein Coding gene. Diseases associated with SLC4A1 include Renal Tubular Acidosis, Distal, Autosomal Dominant and Cryohydrocytosis. Among its related pathways are Transport of glucose and other sugars, bile salts and organic acids, metal ions and amine compounds and Neuroscience. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and transporter activity. An important paralog of this gene is SLC4A2.
## 2 MAOB (Monoamine Oxidase B) is a Protein Coding gene. Diseases associated with MAOB include Norrie Disease and Pathological Gambling. Among its related pathways are Tyrosine metabolism and Activated PKN1 stimulates transcription of AR (androgen receptor) regulated genes KLK2 and KLK3. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and electron transfer activity. An important paralog of this gene is MAOA.
## 3 RAP1GAP (RAP1 GTPase Activating Protein) is a Protein Coding gene. Diseases associated with RAP1GAP include Tuberous Sclerosis and Bleeding Disorder, Platelet-Type, 18. Among its related pathways are Ras signaling pathway and Development Angiotensin activation of ERK. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and GTPase activator activity. An important paralog of this gene is RAP1GAP2.
## 4 ADAMTS2 (ADAM Metallopeptidase With Thrombospondin Type 1 Motif 2) is a Protein Coding gene. Diseases associated with ADAMTS2 include Ehlers-Danlos Syndrome, Dermatosparaxis Type and Ehlers-Danlos Syndrome. Among its related pathways are Degradation of the extracellular matrix and Metabolism of proteins. Gene Ontology (GO) annotations related to this gene include peptidase activity and metallopeptidase activity. An important paralog of this gene is ADAMTS3.
## 5 ANXA3 (Annexin A3) is a Protein Coding gene. Diseases associated with ANXA3 include Ovarian Cancer and Prostate Cancer. Among its related pathways are Prostaglandin Synthesis and Regulation. Gene Ontology (GO) annotations related to this gene include calcium ion binding and calcium-dependent phospholipid binding. An important paralog of this gene is ANXA4.
## 6 SELENBP1 (Selenium Binding Protein 1) is a Protein Coding gene. Diseases associated with SELENBP1 include Extraoral Halitosis Due To Methanethiol Oxidase Deficiency and Posteroinferior Myocardial Infarction. Among its related pathways are IL1 and megakaryocytes in obesity and Metabolism. Gene Ontology (GO) annotations related to this gene include selenium binding.
## UniProtKB_Summary
## 1 Functions both as a transporter that mediates electroneutral anion exchange across the cell membrane and as a structural protein. Major integral membrane glycoprotein of the erythrocyte membrane; required for normal flexibility and stability of the erythrocyte membrane and for normal erythrocyte shape via the interactions of its cytoplasmic domain with cytoskeletal proteins, glycolytic enzymes, and hemoglobin. Functions as a transporter that mediates the 1:1 exchange of inorganic anions across the erythrocyte membrane. Mediates chloride-bicarbonate exchange in the kidney, and is required for normal acidification of the urine.\n B3AT_HUMAN,P02730\n
## 2 Catalyzes the oxidative deamination of biogenic and xenobiotic amines and has important functions in the metabolism of neuroactive and vasoactive amines in the central nervous system and peripheral tissues. MAOB preferentially degrades benzylamine and phenylethylamine.\n AOFB_HUMAN,P27338\n
## 3 GTPase activator for the nuclear Ras-related regulatory protein RAP-1A (KREV-1), converting it to the putatively inactive GDP-bound state.\n RPGP1_HUMAN,P47736\n
## 4 Cleaves the propeptides of type I and II collagen prior to fibril assembly (By similarity). Does not act on type III collagen (By similarity). Cleaves lysyl oxidase LOX at a site downstream of its propeptide cleavage site to produce a short LOX form with reduced collagen-binding activity (PubMed:31152061).\n ATS2_HUMAN,O95450\n
## 5 Inhibitor of phospholipase A2, also possesses anti-coagulant properties. Also cleaves the cyclic bond of inositol 1,2-cyclic phosphate to form inositol 1-phosphate.\n ANXA3_HUMAN,P12429\n
## 6 Catalyzes the oxidation of methanethiol, an organosulfur compound known to be produced in substantial amounts by gut bacteria (PubMed:29255262). Selenium-binding protein which may be involved in the sensing of reactive xenobiotics in the cytoplasm. May be involved in intra-Golgi protein transport (By similarity).\n SBP1_HUMAN,Q13228\n
Merge the two retrieved data frames by gene to add to our numeric data of fold change values.
montcData <- merge(monotonic1genes,monotonicSumms,by.x='gene',
by.y='gene')
head(montcData)
## gene EnsemblGene
## 1 ADAMTS2 ENSG00000087116
## 2 ANXA3 ENSG00000138772
## 3 IFI27 ENSG00000165949
## 4 IGHV1-14 ENSG00000253709
## 5 IGHV1-24 ENSG00000211950
## 6 IGHV1OR15-9 ENSG00000188403
## EntrezSummary
## 1 This gene encodes a member of the ADAMTS (a disintegrin and metalloproteinase with thrombospondin motifs) protein family. Members of the family share several distinct protein modules, including a propeptide region, a metalloproteinase domain, a disintegrin-like domain, and a thrombospondin type 1 (TS) motif. Individual members of this family differ in the number of C-terminal TS motifs, and some have unique C-terminal domains. The encoded preproprotein is proteolytically processed to generate the mature procollagen N-proteinase. This proteinase excises the N-propeptide of the fibrillar procollagens types I-III and type V. Mutations in this gene cause Ehlers-Danlos syndrome type VIIC, a recessively inherited connective-tissue disorder. Alternative splicing results in multiple transcript variants, at least one of which encodes an isoform that is proteolytically processed. [provided by RefSeq, Feb 2016]
## 2 This gene encodes a member of the annexin family. Members of this calcium-dependent phospholipid-binding protein family play a role in the regulation of cellular growth and in signal transduction pathways. This protein functions in the inhibition of phopholipase A2 and cleavage of inositol 1,2-cyclic phosphate to form inositol 1-phosphate. This protein may also play a role in anti-coagulation. [provided by RefSeq, Jul 2008]
## 3 IFI27 (Interferon Alpha Inducible Protein 27) is a Protein Coding gene. Diseases associated with IFI27 include Hepatitis C Virus and Oral Leukoplakia. Among its related pathways are Interferon gamma signaling and Innate Immune System. Gene Ontology (GO) annotations related to this gene include RNA polymerase II activating transcription factor binding and lamin binding. An important paralog of this gene is IFI27L2.
## 4 IGHV1-14 (Immunoglobulin Heavy Variable 1-14 (Pseudogene)) is a Pseudogene.
## 5 IGHV1-24 (Immunoglobulin Heavy Variable 1-24) is a Protein Coding gene. An important paralog of this gene is IGHV1-69-2.
## 6 IGHV1OR15-9 (Immunoglobulin Heavy Variable 1/OR15-9 (Non-Functional)) is a Pseudogene. An important paralog of this gene is IGHV1OR21-1.
## GeneCardsSummary
## 1 ADAMTS2 (ADAM Metallopeptidase With Thrombospondin Type 1 Motif 2) is a Protein Coding gene. Diseases associated with ADAMTS2 include Ehlers-Danlos Syndrome, Dermatosparaxis Type and Ehlers-Danlos Syndrome. Among its related pathways are Degradation of the extracellular matrix and Metabolism of proteins. Gene Ontology (GO) annotations related to this gene include peptidase activity and metallopeptidase activity. An important paralog of this gene is ADAMTS3.
## 2 ANXA3 (Annexin A3) is a Protein Coding gene. Diseases associated with ANXA3 include Ovarian Cancer and Prostate Cancer. Among its related pathways are Prostaglandin Synthesis and Regulation. Gene Ontology (GO) annotations related to this gene include calcium ion binding and calcium-dependent phospholipid binding. An important paralog of this gene is ANXA4.
## 3 IFI27 (Interferon Alpha Inducible Protein 27) is a Protein Coding gene. Diseases associated with IFI27 include Hepatitis C Virus and Oral Leukoplakia. Among its related pathways are Interferon gamma signaling and Innate Immune System. Gene Ontology (GO) annotations related to this gene include RNA polymerase II activating transcription factor binding and lamin binding. An important paralog of this gene is IFI27L2.
## 4 IGHV1-14 (Immunoglobulin Heavy Variable 1-14 (Pseudogene)) is a Pseudogene.
## 5 IGHV1-24 (Immunoglobulin Heavy Variable 1-24) is a Protein Coding gene. An important paralog of this gene is IGHV1-69-2.
## 6 IGHV1OR15-9 (Immunoglobulin Heavy Variable 1/OR15-9 (Non-Functional)) is a Pseudogene. An important paralog of this gene is IGHV1OR21-1.
## UniProtKB_Summary
## 1 Cleaves the propeptides of type I and II collagen prior to fibril assembly (By similarity). Does not act on type III collagen (By similarity). Cleaves lysyl oxidase LOX at a site downstream of its propeptide cleavage site to produce a short LOX form with reduced collagen-binding activity (PubMed:31152061).\n ATS2_HUMAN,O95450\n
## 2 Inhibitor of phospholipase A2, also possesses anti-coagulant properties. Also cleaves the cyclic bond of inositol 1,2-cyclic phosphate to form inositol 1-phosphate.\n ANXA3_HUMAN,P12429\n
## 3 Probable adapter protein involved in different biological processes (PubMed:22427340, PubMed:27194766). Part of the signaling pathways that lead to apoptosis (PubMed:18330707, PubMed:27673746, PubMed:24970806). Involved in type-I interferon-induced apoptosis characterized by a rapid and robust release of cytochrome C from the mitochondria and activation of BAX and caspases 2, 3, 6, 8 and 9 (PubMed:18330707, PubMed:27673746). Also functions in TNFSF10-induced apoptosis (PubMed:24970806). May also have a function in the nucleus, where it may be involved in the interferon-induced negative regulation of the transcriptional activity of NR4A1, NR4A2 and NR4A3 through the enhancement of XPO1-mediated nuclear export of these nuclear receptors (PubMed:22427340). May thereby play a role in the vascular response to injury (By similarity). In the innate immune response, has an antiviral activity towards hepatitis C virus/HCV (PubMed:27194766, PubMed:27777077). May prevent the replication of the virus by recruiting both the hepatitis C virus non-structural protein 5A/NS5A and the ubiquitination machinery via SKP2, promoting the ubiquitin-mediated proteasomal degradation of NS5A (PubMed:27194766, PubMed:27777077).\n IFI27_HUMAN,P40305\n
## 4 no summary
## 5 V region of the variable domain of immunoglobulin heavy chains that participates in the antigen recognition (PubMed:24600447). Immunoglobulins, also known as antibodies, are membrane-bound or secreted glycoproteins produced by B lymphocytes. In the recognition phase of humoral immunity, the membrane-bound immunoglobulins serve as receptors which, upon binding of a specific antigen, trigger the clonal expansion and differentiation of B lymphocytes into immunoglobulins-secreting plasma cells. Secreted immunoglobulins mediate the effector phase of humoral immunity, which results in the elimination of bound antigens (PubMed:22158414, PubMed:20176268). The antigen binding site is formed by the variable domain of one heavy chain, together with that of its associated light chain. Thus, each immunoglobulin has two antigen binding sites with remarkable affinity for a particular antigen. The variable domains are assembled by a process called V-(D)-J rearrangement and can then be subjected to somatic hypermutations which, after exposure to antigen and selection, allow affinity maturation for a particular antigen (PubMed:20176268, PubMed:17576170).\n HV124_HUMAN,A0A0C4DH33\n
## 6 no summary
Now combine to numeric fold change data.
montcData2 <- rbind(mIncr,mDecr)
montcData3 <- merge(montcData,montcData2, by.y='Ensemblid',
by.x='EnsemblGene')
head(montcData3)
## EnsemblGene gene
## 1 ENSG00000004939 SLC4A1
## 2 ENSG00000069535 MAOB
## 3 ENSG00000076864 RAP1GAP
## 4 ENSG00000087116 ADAMTS2
## 5 ENSG00000138772 ANXA3
## 6 ENSG00000143416 SELENBP1
## EntrezSummary
## 1 The protein encoded by this gene is part of the anion exchanger (AE) family and is expressed in the erythrocyte plasma membrane, where it functions as a chloride/bicarbonate exchanger involved in carbon dioxide transport from tissues to lungs. The protein comprises two domains that are structurally and functionally distinct. The N-terminal 40kDa domain is located in the cytoplasm and acts as an attachment site for the red cell skeleton by binding ankyrin. The glycosylated C-terminal membrane-associated domain contains 12-14 membrane spanning segments and carries out the stilbene disulphonate-sensitive exchange transport of anions. The cytoplasmic tail at the extreme C-terminus of the membrane domain binds carbonic anhydrase II. The encoded protein associates with the red cell membrane protein glycophorin A and this association promotes the correct folding and translocation of the exchanger. This protein is predominantly dimeric but forms tetramers in the presence of ankyrin. Many mutations in this gene are known in man, and these mutations can lead to two types of disease: destabilization of red cell membrane leading to hereditary spherocytosis, and defective kidney acid secretion leading to distal renal tubular acidosis. Other mutations that do not give rise to disease result in novel blood group antigens, which form the Diego blood group system. Southeast Asian ovalocytosis (SAO, Melanesian ovalocytosis) results from the heterozygous presence of a deletion in the encoded protein and is common in areas where Plasmodium falciparum malaria is endemic. One null mutation in this gene is known, resulting in very severe anemia and nephrocalcinosis. [provided by RefSeq, Jul 2008]
## 2 The protein encoded by this gene belongs to the flavin monoamine oxidase family. It is a enzyme located in the mitochondrial outer membrane. It catalyzes the oxidative deamination of biogenic and xenobiotic amines and plays an important role in the metabolism of neuroactive and vasoactive amines in the central nervous sysytem and peripheral tissues. This protein preferentially degrades benzylamine and phenylethylamine. [provided by RefSeq, Jul 2008]
## 3 This gene encodes a type of GTPase-activating-protein (GAP) that down-regulates the activity of the ras-related RAP1 protein. RAP1 acts as a molecular switch by cycling between an inactive GDP-bound form and an active GTP-bound form. The product of this gene, RAP1GAP, promotes the hydrolysis of bound GTP and hence returns RAP1 to the inactive state whereas other proteins, guanine nucleotide exchange factors (GEFs), act as RAP1 activators by facilitating the conversion of RAP1 from the GDP- to the GTP-bound form. In general, ras subfamily proteins, such as RAP1, play key roles in receptor-linked signaling pathways that control cell growth and differentiation. RAP1 plays a role in diverse processes such as cell proliferation, adhesion, differentiation, and embryogenesis. Alternative splicing results in multiple transcript variants encoding distinct proteins. [provided by RefSeq, Aug 2011]
## 4 This gene encodes a member of the ADAMTS (a disintegrin and metalloproteinase with thrombospondin motifs) protein family. Members of the family share several distinct protein modules, including a propeptide region, a metalloproteinase domain, a disintegrin-like domain, and a thrombospondin type 1 (TS) motif. Individual members of this family differ in the number of C-terminal TS motifs, and some have unique C-terminal domains. The encoded preproprotein is proteolytically processed to generate the mature procollagen N-proteinase. This proteinase excises the N-propeptide of the fibrillar procollagens types I-III and type V. Mutations in this gene cause Ehlers-Danlos syndrome type VIIC, a recessively inherited connective-tissue disorder. Alternative splicing results in multiple transcript variants, at least one of which encodes an isoform that is proteolytically processed. [provided by RefSeq, Feb 2016]
## 5 This gene encodes a member of the annexin family. Members of this calcium-dependent phospholipid-binding protein family play a role in the regulation of cellular growth and in signal transduction pathways. This protein functions in the inhibition of phopholipase A2 and cleavage of inositol 1,2-cyclic phosphate to form inositol 1-phosphate. This protein may also play a role in anti-coagulation. [provided by RefSeq, Jul 2008]
## 6 This gene encodes a member of the selenium-binding protein family. Selenium is an essential nutrient that exhibits potent anticarcinogenic properties, and deficiency of selenium may cause certain neurologic diseases. The effects of selenium in preventing cancer and neurologic diseases may be mediated by selenium-binding proteins, and decreased expression of this gene may be associated with several types of cancer. The encoded protein may play a selenium-dependent role in ubiquitination/deubiquitination-mediated protein degradation. Alternatively spliced transcript variants encoding multiple isoforms have been observed for this gene. [provided by RefSeq, Apr 2012]
## GeneCardsSummary
## 1 SLC4A1 (Solute Carrier Family 4 Member 1 (Diego Blood Group)) is a Protein Coding gene. Diseases associated with SLC4A1 include Renal Tubular Acidosis, Distal, Autosomal Dominant and Cryohydrocytosis. Among its related pathways are Transport of glucose and other sugars, bile salts and organic acids, metal ions and amine compounds and Neuroscience. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and transporter activity. An important paralog of this gene is SLC4A2.
## 2 MAOB (Monoamine Oxidase B) is a Protein Coding gene. Diseases associated with MAOB include Norrie Disease and Pathological Gambling. Among its related pathways are Tyrosine metabolism and Activated PKN1 stimulates transcription of AR (androgen receptor) regulated genes KLK2 and KLK3. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and electron transfer activity. An important paralog of this gene is MAOA.
## 3 RAP1GAP (RAP1 GTPase Activating Protein) is a Protein Coding gene. Diseases associated with RAP1GAP include Tuberous Sclerosis and Bleeding Disorder, Platelet-Type, 18. Among its related pathways are Ras signaling pathway and Development Angiotensin activation of ERK. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and GTPase activator activity. An important paralog of this gene is RAP1GAP2.
## 4 ADAMTS2 (ADAM Metallopeptidase With Thrombospondin Type 1 Motif 2) is a Protein Coding gene. Diseases associated with ADAMTS2 include Ehlers-Danlos Syndrome, Dermatosparaxis Type and Ehlers-Danlos Syndrome. Among its related pathways are Degradation of the extracellular matrix and Metabolism of proteins. Gene Ontology (GO) annotations related to this gene include peptidase activity and metallopeptidase activity. An important paralog of this gene is ADAMTS3.
## 5 ANXA3 (Annexin A3) is a Protein Coding gene. Diseases associated with ANXA3 include Ovarian Cancer and Prostate Cancer. Among its related pathways are Prostaglandin Synthesis and Regulation. Gene Ontology (GO) annotations related to this gene include calcium ion binding and calcium-dependent phospholipid binding. An important paralog of this gene is ANXA4.
## 6 SELENBP1 (Selenium Binding Protein 1) is a Protein Coding gene. Diseases associated with SELENBP1 include Extraoral Halitosis Due To Methanethiol Oxidase Deficiency and Posteroinferior Myocardial Infarction. Among its related pathways are IL1 and megakaryocytes in obesity and Metabolism. Gene Ontology (GO) annotations related to this gene include selenium binding.
## UniProtKB_Summary
## 1 Functions both as a transporter that mediates electroneutral anion exchange across the cell membrane and as a structural protein. Major integral membrane glycoprotein of the erythrocyte membrane; required for normal flexibility and stability of the erythrocyte membrane and for normal erythrocyte shape via the interactions of its cytoplasmic domain with cytoskeletal proteins, glycolytic enzymes, and hemoglobin. Functions as a transporter that mediates the 1:1 exchange of inorganic anions across the erythrocyte membrane. Mediates chloride-bicarbonate exchange in the kidney, and is required for normal acidification of the urine.\n B3AT_HUMAN,P02730\n
## 2 Catalyzes the oxidative deamination of biogenic and xenobiotic amines and has important functions in the metabolism of neuroactive and vasoactive amines in the central nervous system and peripheral tissues. MAOB preferentially degrades benzylamine and phenylethylamine.\n AOFB_HUMAN,P27338\n
## 3 GTPase activator for the nuclear Ras-related regulatory protein RAP-1A (KREV-1), converting it to the putatively inactive GDP-bound state.\n RPGP1_HUMAN,P47736\n
## 4 Cleaves the propeptides of type I and II collagen prior to fibril assembly (By similarity). Does not act on type III collagen (By similarity). Cleaves lysyl oxidase LOX at a site downstream of its propeptide cleavage site to produce a short LOX form with reduced collagen-binding activity (PubMed:31152061).\n ATS2_HUMAN,O95450\n
## 5 Inhibitor of phospholipase A2, also possesses anti-coagulant properties. Also cleaves the cyclic bond of inositol 1,2-cyclic phosphate to form inositol 1-phosphate.\n ANXA3_HUMAN,P12429\n
## 6 Catalyzes the oxidation of methanethiol, an organosulfur compound known to be produced in substantial amounts by gut bacteria (PubMed:29255262). Selenium-binding protein which may be involved in the sensing of reactive xenobiotics in the cytoplasm. May be involved in intra-Golgi protein transport (By similarity).\n SBP1_HUMAN,Q13228\n
## Moderate.Mean Severe.Mean ICU.mean Healthy.Mean ICU.health.foldChange
## 1 193.75 244.750 491.50 6.529412 75.27477
## 2 5.50 61.375 209.75 4.705882 44.57187
## 3 32.00 51.000 157.00 1.647059 95.32143
## 4 5.00 597.625 600.50 7.352941 81.66800
## 5 8.25 39.250 157.25 4.941176 31.82440
## 6 261.25 338.750 1049.75 8.411765 124.79545
## mod.health.foldChange sevr.health.foldChange
## 1 29.673423 37.484234
## 2 1.168750 13.042188
## 3 19.428571 30.964286
## 4 0.680000 81.277000
## 5 1.669643 7.943452
## 6 31.057692 40.270979
Lets write this out to csv and create the same charts but add the annotations and gene symbol.
write.csv(montcData3,'monotonicIncDecrSumms.csv',row.names=FALSE)
Now that I have the new data with summary annotations with the gene symbol instead of the Ensembl ID, I have made the Tableau charts and uploaded them to Tableau Public Server.
The first chart is the monotonically decreasing genes selected with moderate fold change values greater than 20. Hovering on each bar and disease state will give the mean value, fold change value also on the labels respective to each disease state, and the Entrez gene summary for the gene. There is an image of the chart without hovering over any bar on the barchart, and also an image of the chart when hovering over any bar on the barchart for that gene and disease state.
The updated 2nd chart as a new name is the gene symbol and Entrez gene summary annotation for each of these genes shown having monotonically increasing fold change values from least to most severe COVID-19 cases using the ICU fold change greater than 30.
monotonically increasing genes ICU fold change greater than 30
monotonically increasing genes ICU fold change greater than 30 hovered over for details
Next, Lets see if we can use machine learning with just the random forest model and three of the top increasing and decreasing genes to classify the class of COVID-19 severity.
Lets create some conditional features to this data.
montcData3$percentBW_MSI <- ifelse((montcData3$mod.health.foldChange>1.5*
montcData3$sevr.health.foldChange &
montcData3$sevr.health.foldChange>1.5*
montcData3$ICU.health.foldChange),1,0)
montcData3$percentBW_ISM <- ifelse((montcData3$mod.health.foldChange<.5*
montcData3$sevr.health.foldChange &
montcData3$sevr.health.foldChange<.5*
montcData3$ICU.health.foldChange),1,0)
We have our three least and most expressed genes, lets now get those genes.
least3 <- montcData3$gene[montcData3$percentBW_MSI==1]
most3 <- montcData3$gene[montcData3$percentBW_ISM==1]
Our ML genes are:
ML_genes <- c(as.character(paste(least3)),as.character(paste(most3)))
ML_genes
## [1] "IGHV1OR15-9" "IGLV1-41" "IGHV7-4-1" "MAOB" "ANXA3"
## [6] "METTL7B"
ML_DF_leastMost <- subset(montcData3,montcData3$gene %in% ML_genes)
ML_ensmbl_LstMst <- ML_DF_leastMost[,1:2]
ML_ensmbl_LstMst
## EnsemblGene gene
## 2 ENSG00000069535 MAOB
## 5 ENSG00000138772 ANXA3
## 8 ENSG00000170439 METTL7B
## 9 ENSG00000188403 IGHV1OR15-9
## 14 ENSG00000253818 IGLV1-41
## 15 ENSG00000282122 IGHV7-4-1
Lets import the original data, when extracting these genes in Tableau using the filters, it wasn’t the same filtering principles used in the creation of our 10 most and least monotonic genes by fold change.
originalDF <- read.delim('GSE152418_p20047_Study1_RawCounts.txt',sep='\t',
header=T, na.strings=c('',' ','NA'),
stringsAsFactors = F)
colnames(originalDF)
## [1] "ENSEMBLID" "S145_nCOV001_C"
## [3] "S147_nCoV001EUHM.Draw.1" "S149_nCoV002EUHM.Draw.2"
## [5] "S150_nCoV003EUHM.Draw.1" "S151_nCoV004EUHM.Draw.1"
## [7] "S152_nCoV006EUHM.Draw.1" "S153_nCoV007EUHM.Draw.1"
## [9] "S154_nCoV0010EUHM.Draw.1" "S155_nCOV021EUHM"
## [11] "S156_nCOV024EUHM.Draw.1" "S157_nCOV0029EUHM"
## [13] "S175_nCoV024EUHM.Draw.2" "S176_nCoV025EUHM.Draw.1"
## [15] "S177_nCoV025EUHM.Draw.2" "S178_nCoV028EUHM.Draw.1"
## [17] "S179_nCoV033EUHM.Draw.1" "S180_nCoV034EUHM.Draw.1"
## [19] "S061_257" "S062_258"
## [21] "S063_259" "S064_260"
## [23] "S065_261" "S066_265"
## [25] "S067_270" "S068_272"
## [27] "S069_273" "S070_279"
## [29] "S071_280" "S181_255"
## [31] "S182_SHXA10" "S183_263"
## [33] "S184_SHXA18" "S185_266"
## [35] "S186_SHXA14"
Lets look at our header information on the demographics of each samples again.
HeaderInformation
## GSM_ID CN_old CN_new age gender diseaseState
## 1 GSM4615003 S061_257 healthy1 25 M Healthy
## 2 GSM4615006 S062_258 healthy2 70 F Healthy
## 3 GSM4615008 S063_259 healthy3 68 F Healthy
## 4 GSM4615011 S064_260 healthy4 69 M Healthy
## 5 GSM4615014 S065_261 healthy5 29 F Healthy
## 6 GSM4615016 S066_265 healthy6 90 M Healthy
## 7 GSM4615019 S067_270 healthy7 85 F Healthy
## 8 GSM4615022 S068_272 healthy8 28 M Healthy
## 9 GSM4615025 S069_273 healthy9 26 F Healthy
## 10 GSM4615027 S070_279 healthy10 38 M Healthy
## 11 GSM4615030 S071_280 healthy11 84 F Healthy
## 12 GSM4614985 S145_nCOV001_C convalescent 80 M Convalescent
## 13 GSM4614986 S147_nCoV001EUHM-Draw-1 moderate1 75 F COVID-19
## 14 GSM4614987 S149_nCoV002EUHM-Draw-2 severe1 54 M COVID-19
## 15 GSM4614988 S150_nCoV003EUHM-Draw-1 severe2 75 F COVID-19
## 16 GSM4614989 S151_nCoV004EUHM-Draw-1 ICU1 59 M COVID-19
## 17 GSM4614990 S152_nCoV006EUHM-Draw-1 severe3 59 M COVID-19
## 18 GSM4614991 S153_nCoV007EUHM-Draw-1 moderate2 53 F COVID-19
## 19 GSM4614992 S154_nCoV0010EUHM-Draw-1 severe4 64 F COVID-19
## 20 GSM4614993 S155_nCOV021EUHM ICU2 60 F COVID-19
## 21 GSM4614994 S156_nCOV024EUHM-Draw-1 severe5 48 M COVID-19
## 22 GSM4614995 S157_nCOV0029EUHM moderate3 47 F COVID-19
## 23 GSM4614996 S175_nCoV024EUHM-Draw-2 ICU3 48 M COVID-19
## 24 GSM4614997 S176_nCoV025EUHM-Draw-1 severe6 56 F COVID-19
## 25 GSM4614998 S177_nCoV025EUHM-Draw-2 ICU4 56 F COVID-19
## 26 GSM4614999 S178_nCoV028EUHM-Draw-1 severe7 56 M COVID-19
## 27 GSM4615000 S179_nCoV033EUHM-Draw-1 severe8 76 M COVID-19
## 28 GSM4615001 S180_nCoV034EUHM-Draw-1 moderate4 46 F COVID-19
## 29 GSM4615032 S181_255 healthy12 27 M Healthy
## 30 GSM4615033 S182_SHXA10 healthy13 50 F Healthy
## 31 GSM4615034 S183_263 healthy14 23 F Healthy
## 32 GSM4615035 S184_SHXA18 healthy15 55 M Healthy
## 33 GSM4615036 S185_266 healthy16 91 F Healthy
## 34 GSM4615037 S186_SHXA14 healthy17 57 M Healthy
## severity geographicLocation cellType
## 1 Healthy Atlanta, GA, USA PBMC
## 2 Healthy Atlanta, GA, USA PBMC
## 3 Healthy Atlanta, GA, USA PBMC
## 4 Healthy Atlanta, GA, USA PBMC
## 5 Healthy Atlanta, GA, USA PBMC
## 6 Healthy Atlanta, GA, USA PBMC
## 7 Healthy Atlanta, GA, USA PBMC
## 8 Healthy Atlanta, GA, USA PBMC
## 9 Healthy Atlanta, GA, USA PBMC
## 10 Healthy Atlanta, GA, USA PBMC
## 11 Healthy Atlanta, GA, USA PBMC
## 12 Convalescent Atlanta, GA, USA PBMC
## 13 Moderate Atlanta, GA, USA PBMC
## 14 Severe Atlanta, GA, USA PBMC
## 15 Severe Atlanta, GA, USA PBMC
## 16 ICU Atlanta, GA, USA PBMC
## 17 Severe Atlanta, GA, USA PBMC
## 18 Moderate Atlanta, GA, USA PBMC
## 19 Severe Atlanta, GA, USA PBMC
## 20 ICU Atlanta, GA, USA PBMC
## 21 Severe Atlanta, GA, USA PBMC
## 22 Moderate Atlanta, GA, USA PBMC
## 23 ICU Atlanta, GA, USA PBMC
## 24 Severe Atlanta, GA, USA PBMC
## 25 ICU Atlanta, GA, USA PBMC
## 26 Severe Atlanta, GA, USA PBMC
## 27 Severe Atlanta, GA, USA PBMC
## 28 Moderate Atlanta, GA, USA PBMC
## 29 Healthy Atlanta, GA, USA PBMC
## 30 Healthy Atlanta, GA, USA PBMC
## 31 Healthy Atlanta, GA, USA PBMC
## 32 Healthy Atlanta, GA, USA PBMC
## 33 Healthy Atlanta, GA, USA PBMC
## 34 Healthy Atlanta, GA, USA PBMC
cn <- as.data.frame(colnames(originalDF)[2:35])
colnames(cn) <- 'originalNames'
cn$originalNames <- gsub('[.]','-',cn$originalNames,perl=T)
cn2 <- HeaderInformation[,2:3]
newColumnNames <- merge(cn,cn2,by.x='originalNames',by.y='CN_old')
change the names back once merged to merge back with originalDF and new names.
newColumnNames$originalNames <- gsub('-','.',
newColumnNames$originalNames,
perl=T)
DF1 <- merge(ML_ensmbl_LstMst,originalDF,by.x='EnsemblGene',
by.y='ENSEMBLID')
DF2 <- DF1[,-1]
row.names(DF2) <- DF2$gene
DF3 <- DF2[,-1]
DF4 <- as.data.frame(t(DF3))
DF4$sample1 <- row.names(DF4)
DF5 <- merge(DF4,newColumnNames, by.y='originalNames',by.x='sample1')
row.names(DF5) <- DF5$sample1
ML_dataframe3s <- DF5[,-1]
head(ML_dataframe3s)
## MAOB ANXA3 METTL7B IGHV1OR15-9 IGLV1-41 IGHV7-4-1 CN_new
## S061_257 3 14 1 0 8 4 healthy1
## S062_258 8 5 4 0 0 1 healthy2
## S063_259 11 5 8 1 24 0 healthy3
## S064_260 6 2 3 0 2 2 healthy4
## S065_261 8 3 24 0 70 606 healthy5
## S066_265 3 8 31 0 1 29 healthy6
aliases <- ML_dataframe3s$CN_new
ML_dataframe3s$CN_new <- gsub('[0-9]','',ML_dataframe3s$CN_new)
Now lets see how this new set of genes is at predicting our classes of COVID-19. Lets remove the convalescent class first.
ML_dataframe3s$CN_new <- as.factor(paste(ML_dataframe3s$CN_new))
row.names(ML_dataframe3s) <- NULL
set.seed(9875)
inTrain <- createDataPartition(y=ML_dataframe3s$CN_new, p=0.7, list=FALSE)
trainingSet <- ML_dataframe3s[inTrain,]
testingSet <- ML_dataframe3s[-inTrain,]
dim(trainingSet)
## [1] 25 7
dim(testingSet)
## [1] 9 7
Lets first test how well lda predicts the convalescent sample from training all the data samples.
convalescent <- subset(ML_dataframe3s, ML_dataframe3s$CN_new=='convalescent')
set.seed(505005)
ldaMod <- train(CN_new~., method='lda', data=ML_dataframe3s,
trControl=trainControl(method='boot'))
predlda <- predict(ldaMod, convalescent)
DF_ldaConvalescent <- data.frame(predlda, type=convalescent$CN_new)
DF_ldaConvalescent
## predlda type
## 1 healthy convalescent
precisionRecallAccuracy(DF_ldaConvalescent)
## class precision recall accuracy
## 1 convalescent 0 0 0
## 2 NA 0 0 1
## 3 NA 0 0 1
## 4 NA 0 0 1
Given that the LDA model predicted healthy when it did have data to train on only 1 out of 34 samples having this class, the model predicted the class of the convalescent class to be healthy. Maybe it is a misclassification of a sample. The above shows the accuracy measures. Lets try the random forest model too.
set.seed(505005)
rfMod <- train(CN_new~., method='rf', data=ML_dataframe3s,
trControl=trainControl(method='boot'))
predrf <- predict(rfMod, convalescent)
DF_rfConvalescent <- data.frame(predrf, type=convalescent$CN_new)
DF_rfConvalescent
## predrf type
## 1 convalescent convalescent
Interestingly the random forest model, predicted the convalescent class to be its own class of ‘convalescent’ and it also had one sample of ‘convalescent’ to train on this class. Likely it is because the values in this sample are identical to the values in the testing set because it is the exact sample.
precisionRecallAccuracy(DF_rfConvalescent)
## class precision recall accuracy
## 1 convalescent 1 1 1
## 2 NA 0 0 1
## 3 NA 0 0 1
## 4 NA 0 0 1
The above measures show that the precision, recall, and accuracy were 100%. What if we made the testing sample a multicollinear version of itself and test the rf model? Lets multiply it by e.
convalescent2 <- convalescent
convalescent2[,1:6] <- exp(convalescent2[,1:6])
set.seed(505005)
rfMod <- train(CN_new~., method='rf', data=ML_dataframe3s,
trControl=trainControl(method='boot'))
predrf <- predict(rfMod, convalescent2)
DF_rfConvalescent2 <- data.frame(predrf, type=convalescent2$CN_new)
DF_rfConvalescent2
## predrf type
## 1 moderate convalescent
When we changed the convalescent sample to the exponential function or natural number, e, of itself, the random forest classified this sample as moderate. We will assume the sample is healthy or moderate and not its own class of convalescent.
If we remove the convalescent sample to train on in the random forest model, will it predict moderate or healthy or a more severe form? Lets also change the class to moderate in the convalescent testing set.
MLDF <- subset(ML_dataframe3s, ML_dataframe3s$CN_new!='convalescent')
MLDF$CN_new <- as.character(paste(MLDF$CN_new))
MLDF$CN_new <- as.factor(MLDF$CN_new)
DFc <- convalescent[,1:6]
DFc$CN_new <- as.factor(paste('moderate'))
set.seed(505005)
rfMod <- train(CN_new~., method='rf', data=MLDF,
trControl=trainControl(method='boot'))
predrf <- predict(rfMod, DFc)
DF_rfConvalescent3 <- data.frame(predrf, type=DFc$CN_new)
DF_rfConvalescent3
## predrf type
## 1 healthy moderate
The random forest model predicted the convalescent class to be healthy. It says moderate, because I had to change the class level or the model threw and exception error and also I had to reclass the 5 factor levels original to the subset of our training data to 4 factors, because it remembers the omitted class of convalescent. So in the previous chunck the outcome feature was reclassed as character then reclassed as factor to omit the remembered class of ‘convalescent.’
We can do the same thing by reclassifying it as healthy for the convalescent class and see if it works in all the data using the random forest model to classify the convalescent class as healthy.
MLDF2 <- ML_dataframe3s
MLDF2[MLDF2$CN_new=='convalescent',7] <- 'healthy'
MLDF2$CN_new <- as.character(paste(MLDF2$CN_new))
MLDF2$CN_new <- as.factor(MLDF2$CN_new)
DFc <- convalescent[,1:6]
DFc$CN_new <- as.factor(paste('healthy'))
set.seed(505005)
rfMod <- train(CN_new~., method='rf', data=MLDF2,
trControl=trainControl(method='boot'))
predrf <- predict(rfMod, DFc)
DF_rfConvalescent4 <- data.frame(predrf, type=DFc$CN_new)
DF_rfConvalescent4
## predrf type
## 1 healthy healthy
So we see that throwing in the convalescent sample as a healthy sample didn’t affect the outcome of the random forest model predicting it as a healthy label. Lets go ahead and keep this dataset, MLDF2, with the convalescent sample as a healthy sample.
grep('convalescent',ML_dataframe3s$CN_new)
## [1] 12
We know that the original sample for convalescent is observation or row 12 in our data. Lets test these new genes on classifying our four classes of data and see if we can see an improvement from our other models.
set.seed(9875)
inTrain <- createDataPartition(y=MLDF2$CN_new, p=0.7, list=FALSE)
trainingSet <- MLDF2[inTrain,]
testingSet <- MLDF2[-inTrain,]
dim(trainingSet)
## [1] 25 7
dim(testingSet)
## [1] 9 7
set.seed(505005)
rfMod <- train(CN_new~., method='rf', data=trainingSet,
trControl=trainControl(method='boot'))
predrf <- predict(rfMod, testingSet)
DF_RF <- data.frame(predrf, type=testingSet$CN_new)
DF_RF
## predrf type
## 1 healthy healthy
## 2 healthy healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 healthy moderate
## 6 ICU severe
## 7 severe ICU
## 8 ICU severe
## 9 healthy healthy
precisionRecallAccuracy(DF_RF)
## class precision recall accuracy
## 1 healthy 0.833333333333333 1 0.888888888888889
## 2 moderate 0 0 0.888888888888889
## 3 severe 0 0 0.666666666666667
## 4 ICU 0 0 0.666666666666667
Nope! Not an improvement using these genes. Lets see how lda does and knn.
set.seed(505005)
LDAMod <- train(CN_new~., method='lda', data=trainingSet,
trControl=trainControl(method='boot'))
predLDA <- predict(LDAMod, testingSet)
DF_LDA <- data.frame(predLDA, type=testingSet$CN_new)
DF_LDA
## predLDA type
## 1 healthy healthy
## 2 healthy healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 healthy moderate
## 6 ICU severe
## 7 severe ICU
## 8 severe severe
## 9 healthy healthy
precisionRecallAccuracy(DF_LDA)
## class precision recall accuracy
## 1 healthy 0.833333333333333 1 0.888888888888889
## 2 moderate 0 0 0.888888888888889
## 3 severe 0.5 0.5 0.777777777777778
## 4 ICU 0 0 0.777777777777778
The LDA model scored slightly better in precision and recall as well as accuracy in the severe class and the same for the healthy class.Now for the knn model.
set.seed(505005)
KNNMod <- train(CN_new~., method='knn', data=trainingSet,
trControl=trainControl(method='boot'))
predKNN <- predict(KNNMod, testingSet)
DF_KNN <- data.frame(predKNN, type=testingSet$CN_new)
DF_KNN
## predKNN type
## 1 healthy healthy
## 2 healthy healthy
## 3 moderate healthy
## 4 healthy healthy
## 5 severe moderate
## 6 ICU severe
## 7 severe ICU
## 8 healthy severe
## 9 healthy healthy
precisionRecallAccuracy(DF_KNN)
## class precision recall accuracy
## 1 healthy 0.8 0.8 0.777777777777778
## 2 moderate 0 0 0.777777777777778
## 3 severe 0 0 0.555555555555556
## 4 ICU 0 0 0.777777777777778
The KNN model didn’t score any better with these genes for each sample, but slightly worse than the random forest and lda models. Lets see if we can use gbm and rpart for a difference. GBM is what I think is the gradient boosted machines and glm we could also try is the generalized linear models and rpart is a type of trees algorithm for recursive partitioned trees possibly similar to decision trees that don’t build the trees as in depth as the random forest trees.
The rpart model:
set.seed(505005)
RPARTMod <- train(CN_new~., method='rpart', data=trainingSet,
trControl=trainControl(method='cv'))
predRPART <- predict(RPARTMod, testingSet)
DF_RPART <- data.frame(predRPART, type=testingSet$CN_new)
DF_RPART
The rpart model produced an ‘undefined columns’ error with these settings.
The gbm model:
set.seed(505005)
GBMMod <- train(CN_new~., method='gbm', data=trainingSet,
trControl=trainControl(method='cv'))
predGBM <- predict(GBMMod, testingSet)
DF_GBM <- data.frame(predGBM, type=testingSet$CN_new)
DF_GBM
The gbm model also produced a ‘Something is wrong; all the Accuracy metric values are missing’ error.
The glm model:
set.seed(505005)
GLMMod <- train(CN_new~., method='glm', data=trainingSet,
trControl=trainControl(method='boot'))
predGLM <- predict(GLMMod, testingSet)
DF_GLM <- data.frame(predGLM, type=testingSet$CN_new)
DF_GLM
The GLM model also produced the same error as the gbm model with missing accuracy and kappa metric values.
Lets move on to finding out the human body systems like Vitamin D and melanin, and see how they behave in the moderate, severe, and ICU cases compared to the healthy cases by way of our already calculated fold change values. We can use our functions find25genes and the other geneCards2.R script has.
I added the above function to our geneCards2.R sourced script file of functions. This should reset the folder ‘gene scrapes’ that the supplementary files are in too. Resource the script everytime you don’t need the supplementary files from another run, and want fresh files.
source('geneCards2.R')
find25genes('vitamin D')
find25genes('melanin')
getProteinGenes('vitamin D')
getProteinGenes('melanin')
vitD <- read.csv( "Top25vitamin-ds.csv")
melanin <- read.csv("Top25melanins.csv")
vitD3 <- vitD[1:3,1:2]
mel3 <- melanin[1:3,1:2]
integumentary <- rbind(vitD3,mel3)
head(integumentary)
## proteinType proteinSearched
## 1 VDR vitamin-d
## 2 CYP27B1 vitamin-d
## 3 PHEX vitamin-d
## 4 TYR melanin
## 5 TYRP1 melanin
## 6 OCA2 melanin
for (i in vitD3$proteinType){
getSummaries2(i,'vitamin D')
}
for (i in mel3$proteinType){
getSummaries2(i,'melanin')
}
getGeneSummaries('vitamin D')
getGeneSummaries('melanin')
vitD3summs <- read.csv("proteinGeneSummaries_vitamin-d.csv")
mel3summs <- read.csv("proteinGeneSummaries_melanin.csv")
vitDmel_summs <- rbind(vitD3summs,mel3summs)
head(vitDmel_summs)
## proteinSearched gene EnsemblID
## 1 vitamin-d VDR ENSG00000111424
## 2 vitamin-d CYP27B1 ENSG00000111012
## 3 vitamin-d PHEX ENSG00000102174
## 4 melanin TYR ENSG00000077498
## 5 melanin TYRP1 ENSG00000107165
## 6 melanin OCA2 ENSG00000104044
## EntrezSummary
## 1 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
## 2 This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 3 The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 4 The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 5 This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 6 This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
## GeneCardsSummary
## 1 VDR (Vitamin D Receptor) is a Protein Coding gene. Diseases associated with VDR include Vitamin D-Dependent Rickets, Type 2A and Rickets. Among its related pathways are Development_Hedgehog and PTH signaling pathways in bone and cartilage development and Tuberculosis. Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and steroid hormone receptor activity. An important paralog of this gene is NR1I2.
## 2 CYP27B1 (Cytochrome P450 Family 27 Subfamily B Member 1) is a Protein Coding gene. Diseases associated with CYP27B1 include Vitamin D Hydroxylation-Deficient Rickets, Type 1A and Hypocalcemic Vitamin D-Dependent Rickets. Among its related pathways are Cytochrome P450 - arranged by substrate type and Tuberculosis. Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen. An important paralog of this gene is CYP27A1.
## 3 PHEX (Phosphate Regulating Endopeptidase Homolog X-Linked) is a Protein Coding gene. Diseases associated with PHEX include Hypophosphatemic Rickets, X-Linked Dominant and Hypophosphatemic Rickets, X-Linked Recessive. Gene Ontology (GO) annotations related to this gene include metalloendopeptidase activity and aminopeptidase activity. An important paralog of this gene is MMEL1.
## 4 TYR (Tyrosinase) is a Protein Coding gene. Diseases associated with TYR include Albinism, Oculocutaneous, Type Ia and Albinism, Oculocutaneous, Type Ib. Among its related pathways are (S)-reticuline biosynthesis and Tyrosine metabolism. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is TYRP1.
## 5 TYRP1 (Tyrosinase Related Protein 1) is a Protein Coding gene. Diseases associated with TYRP1 include Albinism, Oculocutaneous, Type Iii and Skin/Hair/Eye Pigmentation, Variation In, 11. Among its related pathways are Aldosterone synthesis and secretion and Viral mRNA Translation. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is DCT.
## 6 OCA2 (OCA2 Melanosomal Transmembrane Protein) is a Protein Coding gene. Diseases associated with OCA2 include Albinism, Oculocutaneous, Type Ii and Skin/Hair/Eye Pigmentation, Variation In, 1. Among its related pathways are Viral mRNA Translation and Metabolism. Gene Ontology (GO) annotations related to this gene include transporter activity and L-tyrosine transmembrane transporter activity. An important paralog of this gene is SLC13A2.
## UniProtKB_Summary
## 1 Nuclear receptor for calcitriol, the active form of vitamin D3 which mediates the action of this vitamin on cells (PubMed:28698609, PubMed:16913708, PubMed:15728261, PubMed:10678179). Enters the nucleus upon vitamin D3 binding where it forms heterodimers with the retinoid X receptor/RXR (PubMed:28698609). The VDR-RXR heterodimers bind to specific response elements on DNA and activate the transcription of vitamin D3-responsive target genes (PubMed:28698609). Plays a central role in calcium homeostasis (By similarity).\n VDR_HUMAN,P11473\n
## 2 A cytochrome P450 monooxygenase involved in vitamin D metabolism and in calcium and phosphorus homeostasis. Catalyzes the rate-limiting step in the activation of vitamin D in the kidney, namely the hydroxylation of 25-hydroxyvitamin D3/calcidiol at the C1alpha-position to form the hormonally active form of vitamin D3, 1alpha,25-dihydroxyvitamin D3/calcitriol that acts via the vitamin D receptor (VDR) (PubMed:10518789, PubMed:9486994, PubMed:22862690, PubMed:10566658, PubMed:12050193). Has 1alpha-hydroxylase activity on vitamin D intermediates of the CYP24A1-mediated inactivation pathway (PubMed:10518789, PubMed:22862690). Converts 24R,25-dihydroxyvitamin D3/secalciferol to 1-alpha,24,25-trihydroxyvitamin D3, an active ligand of VDR. Also active on 25-hydroxyvitamin D2 (PubMed:10518789). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via FDXR/adrenodoxin reductase and FDX1/adrenodoxin (PubMed:22862690).\n CP27B_HUMAN,O15528\n
## 3 Peptidase that cleaves SIBLING (small integrin-binding ligand, N-linked glycoprotein)-derived ASARM peptides, thus regulating their biological activity (PubMed:9593714, PubMed:15664000, PubMed:18162525, PubMed:18597632). Cleaves ASARM peptides between Ser and Glu or Asp residues (PubMed:18597632). Regulates osteogenic cell differentiation and bone mineralization through the cleavage of the MEPE-derived ASARM peptide (PubMed:18597632). Promotes dentin mineralization and renal phosphate reabsorption by cleaving DMP1- and MEPE-derived ASARM peptides (PubMed:18597632, PubMed:18162525). Inhibits the cleavage of MEPE by CTSB/cathepsin B thus preventing MEPE degradation (PubMed:12220505).\n PHEX_HUMAN,P78562\n
## 4 This is a copper-containing oxidase that functions in the formation of pigments such as melanins and other polyphenolic compounds. Catalyzes the initial and rate limiting step in the cascade of reactions leading to melanin production from tyrosine. In addition to hydroxylating tyrosine to DOPA (3,4-dihydroxyphenylalanine), also catalyzes the oxidation of DOPA to DOPA-quinone, and possibly the oxidation of DHI (5,6-dihydroxyindole) to indole-5,6 quinone.\n TYRO_HUMAN,P14679\n
## 5 Plays a role in melanin biosynthesis (PubMed:22556244, PubMed:16704458). Catalyzes the oxidation of 5,6-dihydroxyindole-2-carboxylic acid (DHICA) into indole-5,6-quinone-2-carboxylic acid in the presence of bound Cu(2+) ions, but not in the presence of Zn(2+) (PubMed:28661582). May regulate or influence the type of melanin synthesized (PubMed:22556244, PubMed:16704458). Also to a lower extent, capable of hydroxylating tyrosine and producing melanin (By similarity).\n TYRP1_HUMAN,P17643\n
## 6 Could be involved in the transport of tyrosine, the precursor to melanin synthesis, within the melanocyte. Regulates the pH of melanosome and the melanosome maturation. One of the components of the mammalian pigmentary system. Seems to regulate the post-translational processing of tyrosinase, which catalyzes the limiting reaction in melanin synthesis. May serve as a key control point at which ethnic skin color variation is determined. Major determinant of brown and/or blue eye color.\n P_HUMAN,Q04671\n
## todaysDate
## 1 Sat Aug 15 18:15:31 2020
## 2 Sat Aug 15 18:15:32 2020
## 3 Sat Aug 15 18:15:33 2020
## 4 Sat Aug 15 18:15:34 2020
## 5 Sat Aug 15 18:15:35 2020
## 6 Sat Aug 15 18:15:36 2020
vitD_melanin <- vitDmel_summs[,c(1:6)]
vitD_melanin
## proteinSearched gene EnsemblID
## 1 vitamin-d VDR ENSG00000111424
## 2 vitamin-d CYP27B1 ENSG00000111012
## 3 vitamin-d PHEX ENSG00000102174
## 4 melanin TYR ENSG00000077498
## 5 melanin TYRP1 ENSG00000107165
## 6 melanin OCA2 ENSG00000104044
## EntrezSummary
## 1 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
## 2 This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 3 The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 4 The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 5 This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 6 This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
## GeneCardsSummary
## 1 VDR (Vitamin D Receptor) is a Protein Coding gene. Diseases associated with VDR include Vitamin D-Dependent Rickets, Type 2A and Rickets. Among its related pathways are Development_Hedgehog and PTH signaling pathways in bone and cartilage development and Tuberculosis. Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and steroid hormone receptor activity. An important paralog of this gene is NR1I2.
## 2 CYP27B1 (Cytochrome P450 Family 27 Subfamily B Member 1) is a Protein Coding gene. Diseases associated with CYP27B1 include Vitamin D Hydroxylation-Deficient Rickets, Type 1A and Hypocalcemic Vitamin D-Dependent Rickets. Among its related pathways are Cytochrome P450 - arranged by substrate type and Tuberculosis. Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen. An important paralog of this gene is CYP27A1.
## 3 PHEX (Phosphate Regulating Endopeptidase Homolog X-Linked) is a Protein Coding gene. Diseases associated with PHEX include Hypophosphatemic Rickets, X-Linked Dominant and Hypophosphatemic Rickets, X-Linked Recessive. Gene Ontology (GO) annotations related to this gene include metalloendopeptidase activity and aminopeptidase activity. An important paralog of this gene is MMEL1.
## 4 TYR (Tyrosinase) is a Protein Coding gene. Diseases associated with TYR include Albinism, Oculocutaneous, Type Ia and Albinism, Oculocutaneous, Type Ib. Among its related pathways are (S)-reticuline biosynthesis and Tyrosine metabolism. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is TYRP1.
## 5 TYRP1 (Tyrosinase Related Protein 1) is a Protein Coding gene. Diseases associated with TYRP1 include Albinism, Oculocutaneous, Type Iii and Skin/Hair/Eye Pigmentation, Variation In, 11. Among its related pathways are Aldosterone synthesis and secretion and Viral mRNA Translation. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is DCT.
## 6 OCA2 (OCA2 Melanosomal Transmembrane Protein) is a Protein Coding gene. Diseases associated with OCA2 include Albinism, Oculocutaneous, Type Ii and Skin/Hair/Eye Pigmentation, Variation In, 1. Among its related pathways are Viral mRNA Translation and Metabolism. Gene Ontology (GO) annotations related to this gene include transporter activity and L-tyrosine transmembrane transporter activity. An important paralog of this gene is SLC13A2.
## UniProtKB_Summary
## 1 Nuclear receptor for calcitriol, the active form of vitamin D3 which mediates the action of this vitamin on cells (PubMed:28698609, PubMed:16913708, PubMed:15728261, PubMed:10678179). Enters the nucleus upon vitamin D3 binding where it forms heterodimers with the retinoid X receptor/RXR (PubMed:28698609). The VDR-RXR heterodimers bind to specific response elements on DNA and activate the transcription of vitamin D3-responsive target genes (PubMed:28698609). Plays a central role in calcium homeostasis (By similarity).\n VDR_HUMAN,P11473\n
## 2 A cytochrome P450 monooxygenase involved in vitamin D metabolism and in calcium and phosphorus homeostasis. Catalyzes the rate-limiting step in the activation of vitamin D in the kidney, namely the hydroxylation of 25-hydroxyvitamin D3/calcidiol at the C1alpha-position to form the hormonally active form of vitamin D3, 1alpha,25-dihydroxyvitamin D3/calcitriol that acts via the vitamin D receptor (VDR) (PubMed:10518789, PubMed:9486994, PubMed:22862690, PubMed:10566658, PubMed:12050193). Has 1alpha-hydroxylase activity on vitamin D intermediates of the CYP24A1-mediated inactivation pathway (PubMed:10518789, PubMed:22862690). Converts 24R,25-dihydroxyvitamin D3/secalciferol to 1-alpha,24,25-trihydroxyvitamin D3, an active ligand of VDR. Also active on 25-hydroxyvitamin D2 (PubMed:10518789). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via FDXR/adrenodoxin reductase and FDX1/adrenodoxin (PubMed:22862690).\n CP27B_HUMAN,O15528\n
## 3 Peptidase that cleaves SIBLING (small integrin-binding ligand, N-linked glycoprotein)-derived ASARM peptides, thus regulating their biological activity (PubMed:9593714, PubMed:15664000, PubMed:18162525, PubMed:18597632). Cleaves ASARM peptides between Ser and Glu or Asp residues (PubMed:18597632). Regulates osteogenic cell differentiation and bone mineralization through the cleavage of the MEPE-derived ASARM peptide (PubMed:18597632). Promotes dentin mineralization and renal phosphate reabsorption by cleaving DMP1- and MEPE-derived ASARM peptides (PubMed:18597632, PubMed:18162525). Inhibits the cleavage of MEPE by CTSB/cathepsin B thus preventing MEPE degradation (PubMed:12220505).\n PHEX_HUMAN,P78562\n
## 4 This is a copper-containing oxidase that functions in the formation of pigments such as melanins and other polyphenolic compounds. Catalyzes the initial and rate limiting step in the cascade of reactions leading to melanin production from tyrosine. In addition to hydroxylating tyrosine to DOPA (3,4-dihydroxyphenylalanine), also catalyzes the oxidation of DOPA to DOPA-quinone, and possibly the oxidation of DHI (5,6-dihydroxyindole) to indole-5,6 quinone.\n TYRO_HUMAN,P14679\n
## 5 Plays a role in melanin biosynthesis (PubMed:22556244, PubMed:16704458). Catalyzes the oxidation of 5,6-dihydroxyindole-2-carboxylic acid (DHICA) into indole-5,6-quinone-2-carboxylic acid in the presence of bound Cu(2+) ions, but not in the presence of Zn(2+) (PubMed:28661582). May regulate or influence the type of melanin synthesized (PubMed:22556244, PubMed:16704458). Also to a lower extent, capable of hydroxylating tyrosine and producing melanin (By similarity).\n TYRP1_HUMAN,P17643\n
## 6 Could be involved in the transport of tyrosine, the precursor to melanin synthesis, within the melanocyte. Regulates the pH of melanosome and the melanosome maturation. One of the components of the mammalian pigmentary system. Seems to regulate the post-translational processing of tyrosinase, which catalyzes the limiting reaction in melanin synthesis. May serve as a key control point at which ethnic skin color variation is determined. Major determinant of brown and/or blue eye color.\n P_HUMAN,Q04671\n
origFCs <- read.csv('DATA_FCs_GSE152418.csv',sep=',',
na.strings=c('',' ','NA'),
stringsAsFactors = F)
sun <- merge(vitD_melanin,origFCs, by.x='EnsemblID',
by.y='ENSEMBLID')
sun
## EnsemblID proteinSearched gene
## 1 ENSG00000077498 melanin TYR
## 2 ENSG00000102174 vitamin-d PHEX
## 3 ENSG00000104044 melanin OCA2
## 4 ENSG00000107165 melanin TYRP1
## 5 ENSG00000111012 vitamin-d CYP27B1
## 6 ENSG00000111424 vitamin-d VDR
## EntrezSummary
## 1 The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 2 The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 3 This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
## 4 This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 5 This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 6 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
## GeneCardsSummary
## 1 TYR (Tyrosinase) is a Protein Coding gene. Diseases associated with TYR include Albinism, Oculocutaneous, Type Ia and Albinism, Oculocutaneous, Type Ib. Among its related pathways are (S)-reticuline biosynthesis and Tyrosine metabolism. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is TYRP1.
## 2 PHEX (Phosphate Regulating Endopeptidase Homolog X-Linked) is a Protein Coding gene. Diseases associated with PHEX include Hypophosphatemic Rickets, X-Linked Dominant and Hypophosphatemic Rickets, X-Linked Recessive. Gene Ontology (GO) annotations related to this gene include metalloendopeptidase activity and aminopeptidase activity. An important paralog of this gene is MMEL1.
## 3 OCA2 (OCA2 Melanosomal Transmembrane Protein) is a Protein Coding gene. Diseases associated with OCA2 include Albinism, Oculocutaneous, Type Ii and Skin/Hair/Eye Pigmentation, Variation In, 1. Among its related pathways are Viral mRNA Translation and Metabolism. Gene Ontology (GO) annotations related to this gene include transporter activity and L-tyrosine transmembrane transporter activity. An important paralog of this gene is SLC13A2.
## 4 TYRP1 (Tyrosinase Related Protein 1) is a Protein Coding gene. Diseases associated with TYRP1 include Albinism, Oculocutaneous, Type Iii and Skin/Hair/Eye Pigmentation, Variation In, 11. Among its related pathways are Aldosterone synthesis and secretion and Viral mRNA Translation. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is DCT.
## 5 CYP27B1 (Cytochrome P450 Family 27 Subfamily B Member 1) is a Protein Coding gene. Diseases associated with CYP27B1 include Vitamin D Hydroxylation-Deficient Rickets, Type 1A and Hypocalcemic Vitamin D-Dependent Rickets. Among its related pathways are Cytochrome P450 - arranged by substrate type and Tuberculosis. Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen. An important paralog of this gene is CYP27A1.
## 6 VDR (Vitamin D Receptor) is a Protein Coding gene. Diseases associated with VDR include Vitamin D-Dependent Rickets, Type 2A and Rickets. Among its related pathways are Development_Hedgehog and PTH signaling pathways in bone and cartilage development and Tuberculosis. Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and steroid hormone receptor activity. An important paralog of this gene is NR1I2.
## UniProtKB_Summary
## 1 This is a copper-containing oxidase that functions in the formation of pigments such as melanins and other polyphenolic compounds. Catalyzes the initial and rate limiting step in the cascade of reactions leading to melanin production from tyrosine. In addition to hydroxylating tyrosine to DOPA (3,4-dihydroxyphenylalanine), also catalyzes the oxidation of DOPA to DOPA-quinone, and possibly the oxidation of DHI (5,6-dihydroxyindole) to indole-5,6 quinone.\n TYRO_HUMAN,P14679\n
## 2 Peptidase that cleaves SIBLING (small integrin-binding ligand, N-linked glycoprotein)-derived ASARM peptides, thus regulating their biological activity (PubMed:9593714, PubMed:15664000, PubMed:18162525, PubMed:18597632). Cleaves ASARM peptides between Ser and Glu or Asp residues (PubMed:18597632). Regulates osteogenic cell differentiation and bone mineralization through the cleavage of the MEPE-derived ASARM peptide (PubMed:18597632). Promotes dentin mineralization and renal phosphate reabsorption by cleaving DMP1- and MEPE-derived ASARM peptides (PubMed:18597632, PubMed:18162525). Inhibits the cleavage of MEPE by CTSB/cathepsin B thus preventing MEPE degradation (PubMed:12220505).\n PHEX_HUMAN,P78562\n
## 3 Could be involved in the transport of tyrosine, the precursor to melanin synthesis, within the melanocyte. Regulates the pH of melanosome and the melanosome maturation. One of the components of the mammalian pigmentary system. Seems to regulate the post-translational processing of tyrosinase, which catalyzes the limiting reaction in melanin synthesis. May serve as a key control point at which ethnic skin color variation is determined. Major determinant of brown and/or blue eye color.\n P_HUMAN,Q04671\n
## 4 Plays a role in melanin biosynthesis (PubMed:22556244, PubMed:16704458). Catalyzes the oxidation of 5,6-dihydroxyindole-2-carboxylic acid (DHICA) into indole-5,6-quinone-2-carboxylic acid in the presence of bound Cu(2+) ions, but not in the presence of Zn(2+) (PubMed:28661582). May regulate or influence the type of melanin synthesized (PubMed:22556244, PubMed:16704458). Also to a lower extent, capable of hydroxylating tyrosine and producing melanin (By similarity).\n TYRP1_HUMAN,P17643\n
## 5 A cytochrome P450 monooxygenase involved in vitamin D metabolism and in calcium and phosphorus homeostasis. Catalyzes the rate-limiting step in the activation of vitamin D in the kidney, namely the hydroxylation of 25-hydroxyvitamin D3/calcidiol at the C1alpha-position to form the hormonally active form of vitamin D3, 1alpha,25-dihydroxyvitamin D3/calcitriol that acts via the vitamin D receptor (VDR) (PubMed:10518789, PubMed:9486994, PubMed:22862690, PubMed:10566658, PubMed:12050193). Has 1alpha-hydroxylase activity on vitamin D intermediates of the CYP24A1-mediated inactivation pathway (PubMed:10518789, PubMed:22862690). Converts 24R,25-dihydroxyvitamin D3/secalciferol to 1-alpha,24,25-trihydroxyvitamin D3, an active ligand of VDR. Also active on 25-hydroxyvitamin D2 (PubMed:10518789). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via FDXR/adrenodoxin reductase and FDX1/adrenodoxin (PubMed:22862690).\n CP27B_HUMAN,O15528\n
## 6 Nuclear receptor for calcitriol, the active form of vitamin D3 which mediates the action of this vitamin on cells (PubMed:28698609, PubMed:16913708, PubMed:15728261, PubMed:10678179). Enters the nucleus upon vitamin D3 binding where it forms heterodimers with the retinoid X receptor/RXR (PubMed:28698609). The VDR-RXR heterodimers bind to specific response elements on DNA and activate the transcription of vitamin D3-responsive target genes (PubMed:28698609). Plays a central role in calcium homeostasis (By similarity).\n VDR_HUMAN,P11473\n
## convalescent healthy1 healthy2 healthy3 healthy4 healthy5 healthy6 healthy7
## 1 0 0 0 0 0 0 0 0
## 2 24 17 25 27 9 44 88 15
## 3 2 0 3 1 0 2 0 2
## 4 0 0 0 0 1 0 0 0
## 5 11 13 16 2 9 14 11 4
## 6 539 592 437 303 365 658 846 471
## healthy8 healthy9 healthy10 healthy11 healthy12 healthy13 healthy14 healthy15
## 1 0 0 0 0 0 0 0 2
## 2 29 38 23 248 10 10 43 40
## 3 4 3 2 0 3 3 2 1
## 4 0 0 0 0 0 0 1 0
## 5 1 8 5 9 4 5 11 12
## 6 502 422 458 488 373 361 334 368
## healthy16 healthy17 moderate1 moderate2 moderate3 moderate4 severe1 severe2
## 1 0 0 0 0 0 0 1 0
## 2 5 25 14 6 8 10 11 6
## 3 4 4 1 0 4 1 4 2
## 4 0 0 0 1 1 2 1 1
## 5 4 6 5 6 12 15 3 19
## 6 318 330 188 421 469 360 415 400
## severe3 severe4 severe5 severe6 severe7 severe8 ICU1 ICU2 ICU3 ICU4
## 1 0 0 0 1 0 1 0 2 0 0
## 2 2 3 47 12 7 8 13 9 30 9
## 3 1 1 0 2 2 27 0 5 3 1
## 4 0 0 0 1 1 0 0 1 4 0
## 5 13 0 4 7 9 27 11 15 1 7
## 6 335 623 670 493 398 376 708 639 701 424
## healthy_mean moderate_mean severe_mean ICU_mean mod_health_foldChange
## 1 0.1176471 0.0 0.375 0.50 0.8823529
## 2 40.9411800 9.5 12.000 15.25 0.2320402
## 3 2.0000000 1.5 4.875 2.25 0.7500000
## 4 0.1176471 1.0 0.500 1.25 8.5000000
## 5 7.8823530 9.5 10.250 8.50 1.2052240
## 6 448.5882000 359.5 463.750 618.00 0.8014031
## sevr_health_foldChange ICU_health_foldChange
## 1 3.1875000 4.2500000
## 2 0.2931034 0.3724856
## 3 2.4375000 1.1250000
## 4 4.2500000 10.6250000
## 5 1.3003730 1.0783580
## 6 1.0337990 1.3776550
Great lets write this out to csv and use it to put together some quick Tableau charts.
write.csv(sun,'sunGenes.csv',row.names=F)
Each of the following are from the same chart comparing sun Gene values in COVID-19 cases and VDR does increase with the severity of the case. The top outlier was subsequently exluded within Tableau to knock down the outlier genes, but hovering gives the names and gene summary as well as lableded fold change values of the disease/healthy mean ratio for each respective disease class.
image 1 sun Genes with VDR at the top
image 2 sun Genes with VDR excluded and PHEX as the highest healthy mean value showing
image 3 sun Genes with PHEX and VDR excluded
image 4 sun Genes with PHEX, VDR, and CYP27B1 excluded
Lets look at the data in another way. I want to gather the columns and combine with the demographic information. But first, there are two genes in the vitamin D genes’ top 3 ranked from genecards.org, PHEX and VDR, that increase in fold change in the cases of moderate, to severe, to ICU for COVID-19, and one melanin gene, TYR.
sun[,c(2:4,45:47)]
## proteinSearched gene
## 1 melanin TYR
## 2 vitamin-d PHEX
## 3 melanin OCA2
## 4 melanin TYRP1
## 5 vitamin-d CYP27B1
## 6 vitamin-d VDR
## EntrezSummary
## 1 The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 2 The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 3 This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
## 4 This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 5 This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 6 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
## mod_health_foldChange sevr_health_foldChange ICU_health_foldChange
## 1 0.8823529 3.1875000 4.2500000
## 2 0.2320402 0.2931034 0.3724856
## 3 0.7500000 2.4375000 1.1250000
## 4 8.5000000 4.2500000 10.6250000
## 5 1.2052240 1.3003730 1.0783580
## 6 0.8014031 1.0337990 1.3776550
sunGathered <- gather(sun, key='sampleType',value='geneValue',7:40)
head(sunGathered)
## EnsemblID proteinSearched gene
## 1 ENSG00000077498 melanin TYR
## 2 ENSG00000102174 vitamin-d PHEX
## 3 ENSG00000104044 melanin OCA2
## 4 ENSG00000107165 melanin TYRP1
## 5 ENSG00000111012 vitamin-d CYP27B1
## 6 ENSG00000111424 vitamin-d VDR
## EntrezSummary
## 1 The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 2 The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 3 This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
## 4 This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 5 This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 6 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
## GeneCardsSummary
## 1 TYR (Tyrosinase) is a Protein Coding gene. Diseases associated with TYR include Albinism, Oculocutaneous, Type Ia and Albinism, Oculocutaneous, Type Ib. Among its related pathways are (S)-reticuline biosynthesis and Tyrosine metabolism. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is TYRP1.
## 2 PHEX (Phosphate Regulating Endopeptidase Homolog X-Linked) is a Protein Coding gene. Diseases associated with PHEX include Hypophosphatemic Rickets, X-Linked Dominant and Hypophosphatemic Rickets, X-Linked Recessive. Gene Ontology (GO) annotations related to this gene include metalloendopeptidase activity and aminopeptidase activity. An important paralog of this gene is MMEL1.
## 3 OCA2 (OCA2 Melanosomal Transmembrane Protein) is a Protein Coding gene. Diseases associated with OCA2 include Albinism, Oculocutaneous, Type Ii and Skin/Hair/Eye Pigmentation, Variation In, 1. Among its related pathways are Viral mRNA Translation and Metabolism. Gene Ontology (GO) annotations related to this gene include transporter activity and L-tyrosine transmembrane transporter activity. An important paralog of this gene is SLC13A2.
## 4 TYRP1 (Tyrosinase Related Protein 1) is a Protein Coding gene. Diseases associated with TYRP1 include Albinism, Oculocutaneous, Type Iii and Skin/Hair/Eye Pigmentation, Variation In, 11. Among its related pathways are Aldosterone synthesis and secretion and Viral mRNA Translation. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is DCT.
## 5 CYP27B1 (Cytochrome P450 Family 27 Subfamily B Member 1) is a Protein Coding gene. Diseases associated with CYP27B1 include Vitamin D Hydroxylation-Deficient Rickets, Type 1A and Hypocalcemic Vitamin D-Dependent Rickets. Among its related pathways are Cytochrome P450 - arranged by substrate type and Tuberculosis. Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen. An important paralog of this gene is CYP27A1.
## 6 VDR (Vitamin D Receptor) is a Protein Coding gene. Diseases associated with VDR include Vitamin D-Dependent Rickets, Type 2A and Rickets. Among its related pathways are Development_Hedgehog and PTH signaling pathways in bone and cartilage development and Tuberculosis. Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and steroid hormone receptor activity. An important paralog of this gene is NR1I2.
## UniProtKB_Summary
## 1 This is a copper-containing oxidase that functions in the formation of pigments such as melanins and other polyphenolic compounds. Catalyzes the initial and rate limiting step in the cascade of reactions leading to melanin production from tyrosine. In addition to hydroxylating tyrosine to DOPA (3,4-dihydroxyphenylalanine), also catalyzes the oxidation of DOPA to DOPA-quinone, and possibly the oxidation of DHI (5,6-dihydroxyindole) to indole-5,6 quinone.\n TYRO_HUMAN,P14679\n
## 2 Peptidase that cleaves SIBLING (small integrin-binding ligand, N-linked glycoprotein)-derived ASARM peptides, thus regulating their biological activity (PubMed:9593714, PubMed:15664000, PubMed:18162525, PubMed:18597632). Cleaves ASARM peptides between Ser and Glu or Asp residues (PubMed:18597632). Regulates osteogenic cell differentiation and bone mineralization through the cleavage of the MEPE-derived ASARM peptide (PubMed:18597632). Promotes dentin mineralization and renal phosphate reabsorption by cleaving DMP1- and MEPE-derived ASARM peptides (PubMed:18597632, PubMed:18162525). Inhibits the cleavage of MEPE by CTSB/cathepsin B thus preventing MEPE degradation (PubMed:12220505).\n PHEX_HUMAN,P78562\n
## 3 Could be involved in the transport of tyrosine, the precursor to melanin synthesis, within the melanocyte. Regulates the pH of melanosome and the melanosome maturation. One of the components of the mammalian pigmentary system. Seems to regulate the post-translational processing of tyrosinase, which catalyzes the limiting reaction in melanin synthesis. May serve as a key control point at which ethnic skin color variation is determined. Major determinant of brown and/or blue eye color.\n P_HUMAN,Q04671\n
## 4 Plays a role in melanin biosynthesis (PubMed:22556244, PubMed:16704458). Catalyzes the oxidation of 5,6-dihydroxyindole-2-carboxylic acid (DHICA) into indole-5,6-quinone-2-carboxylic acid in the presence of bound Cu(2+) ions, but not in the presence of Zn(2+) (PubMed:28661582). May regulate or influence the type of melanin synthesized (PubMed:22556244, PubMed:16704458). Also to a lower extent, capable of hydroxylating tyrosine and producing melanin (By similarity).\n TYRP1_HUMAN,P17643\n
## 5 A cytochrome P450 monooxygenase involved in vitamin D metabolism and in calcium and phosphorus homeostasis. Catalyzes the rate-limiting step in the activation of vitamin D in the kidney, namely the hydroxylation of 25-hydroxyvitamin D3/calcidiol at the C1alpha-position to form the hormonally active form of vitamin D3, 1alpha,25-dihydroxyvitamin D3/calcitriol that acts via the vitamin D receptor (VDR) (PubMed:10518789, PubMed:9486994, PubMed:22862690, PubMed:10566658, PubMed:12050193). Has 1alpha-hydroxylase activity on vitamin D intermediates of the CYP24A1-mediated inactivation pathway (PubMed:10518789, PubMed:22862690). Converts 24R,25-dihydroxyvitamin D3/secalciferol to 1-alpha,24,25-trihydroxyvitamin D3, an active ligand of VDR. Also active on 25-hydroxyvitamin D2 (PubMed:10518789). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via FDXR/adrenodoxin reductase and FDX1/adrenodoxin (PubMed:22862690).\n CP27B_HUMAN,O15528\n
## 6 Nuclear receptor for calcitriol, the active form of vitamin D3 which mediates the action of this vitamin on cells (PubMed:28698609, PubMed:16913708, PubMed:15728261, PubMed:10678179). Enters the nucleus upon vitamin D3 binding where it forms heterodimers with the retinoid X receptor/RXR (PubMed:28698609). The VDR-RXR heterodimers bind to specific response elements on DNA and activate the transcription of vitamin D3-responsive target genes (PubMed:28698609). Plays a central role in calcium homeostasis (By similarity).\n VDR_HUMAN,P11473\n
## healthy_mean moderate_mean severe_mean ICU_mean mod_health_foldChange
## 1 0.1176471 0.0 0.375 0.50 0.8823529
## 2 40.9411800 9.5 12.000 15.25 0.2320402
## 3 2.0000000 1.5 4.875 2.25 0.7500000
## 4 0.1176471 1.0 0.500 1.25 8.5000000
## 5 7.8823530 9.5 10.250 8.50 1.2052240
## 6 448.5882000 359.5 463.750 618.00 0.8014031
## sevr_health_foldChange ICU_health_foldChange sampleType geneValue
## 1 3.1875000 4.2500000 convalescent 0
## 2 0.2931034 0.3724856 convalescent 24
## 3 2.4375000 1.1250000 convalescent 2
## 4 4.2500000 10.6250000 convalescent 0
## 5 1.3003730 1.0783580 convalescent 11
## 6 1.0337990 1.3776550 convalescent 539
sunData <- merge(HeaderInformation,sunGathered,by.x='CN_new',
by.y='sampleType')
write.csv(sunData,'sunData.csv',row.names=FALSE)
An interesting chart showing the raw values of gene expression for the three melanin and Vitamin D genes top ranked by genecards.org was created and shared on Tableau Public Server
We can see some genes decrease with age and others increase, also the gender differences as a color dimension shows changes in gene expression by gender. Take a look at the link above in Tableau and the images below to this chart.
Demographics of Sun Genes
In the first image above, there is one outlier for a 76 year old gene when looking at the melanin gene OCA2 labeled severe8 that has much more of this gene expressed than other severe cases. But this gene’s values tend to fluctuate on a cases by case basis for each severe sample.
Demographics of Sun Genes with Entrez gene summary
Demographics of Sun Genes Female outlier Vitamin D
The details when hovering will be displayed if you select a scatter point to give the severity of the case and corresponding sample in that group. The last image shows with age for both females and males the CYP27B1 increases except for a female in her 60s identified as severe4 in the severity raw values of this gene. And a 54 year old male labeled as severe1 is also not following the increase with age.But otherwise this gene CYP27B1 increases with age in both genders within the severe and ICU severe cases, while decreasing with age in the moderate cases. In healthy cases there is not a recognizable pattern as it varies with age and gender.
Lets not stop at the sun genes, as a massage therapist of more than 14 years of experience and having recently studied for and taken and passed my MBLEx or Massage and Bodywork Licensing Examination, I can tell you there are many fascinating items of the body systems and mineral as well as vitamin dependencies that lead to disease in some people. But when relearning the endocrine system and the hormones related to the pineal, hypothalamus, pituitary, adrenals, thyroid, and pancreas many other vitamins, steroids, and hormones should be looked at in studying these different cases of COVID-19.
We will look at the Vitamin C which helps the body absorb Vitamin D and make calcium in the bone blood, the glucagon that turn glucose into sugar and insulin that lowers glucose in the blood having to do with the pancreas hormones, dopamine that relates to parkinsons disease when the hypothalamus doesn’t produce enough, melatonin that regulates sleep and produced by the pineal gland near the pituitary and hypothalamus in the brain that regulates sleep, estrogen, prolactin, and progesterone regulated by the pituitary gland in the brain in females, testosterone regulated by the males in their testes, and corticosteroids and adrenaline regulated by the adrenals when in sympathetic response of danger in the body. Also, the vitamins that people are commonly told to take in addition to Vitamin C and Vitamin D, such as fish oil or omega 3s, vitamin B12 or zinc, and magnesium mineral.Also, calcitonin, a thyroid hormone that breaks down calcium so that the kidneys don’t get kidney stones nor other healthy problems.
Lets use our data we created earlier with our fold change modified values for Inf, NaN, and 0 when z/0, 0/0,0/z where z is any rational number and the alternate value was replaced in respective case with disease mean, 1, or healthy mean.
dataAllFCs <- read.csv('DATA_FCs_GSE152418-monotonicGenes.csv',sep=',',header=T, na.strings=c('',' ','NA'),
stringsAsFactors = F)
colnames(dataAllFCs)
## [1] "ENSEMBLID" "convalescent" "healthy1"
## [4] "healthy2" "healthy3" "healthy4"
## [7] "healthy5" "healthy6" "healthy7"
## [10] "healthy8" "healthy9" "healthy10"
## [13] "healthy11" "healthy12" "healthy13"
## [16] "healthy14" "healthy15" "healthy16"
## [19] "healthy17" "moderate1" "moderate2"
## [22] "moderate3" "moderate4" "severe1"
## [25] "severe2" "severe3" "severe4"
## [28] "severe5" "severe6" "severe7"
## [31] "severe8" "ICU1" "ICU2"
## [34] "ICU3" "ICU4" "healthy_mean"
## [37] "moderate_mean" "severe_mean" "ICU_mean"
## [40] "mod_health_foldChange" "sevr_health_foldChange" "ICU_health_foldChange"
## [43] "monotonicIncrease" "monotonicDecrease"
That is a lot of genes to combine with our find25genes function and summarise, so lets just get the vitamins, minerals, and hormones we are interested in first, combine those into a data set and then add the HGNC gene symbol and the summaries to plot and discuss our findings.
source('geneCards2.R')
find25genes('vitamin D')
find25genes('melanin')
find25genes('vitamin C')
find25genes('glucose')
find25genes('insulin')
find25genes('glucagon')
find25genes('dopamine')
find25genes('estrogen')
find25genes('progesterone')
find25genes('prolactin')
find25genes('testosterone')
find25genes('calcium')
find25genes('melatonin')
find25genes('vitamin B12')
find25genes('zinc')
find25genes('magnesium')
find25genes('fish oil')
find25genes('omega 3s')
find25genes('adrenaline')
find25genes('corticosteroids')
find25genes('calcitonine')
find25genes('iron')
getProteinGenes('vitamin D')
getProteinGenes('melanin')
getProteinGenes('vitamin C')
getProteinGenes('glucose')
getProteinGenes('insulin')
getProteinGenes('glucagon')
getProteinGenes('dopamine')
getProteinGenes('estrogen')
getProteinGenes('progesterone')
getProteinGenes('prolactin')
getProteinGenes('testosterone')
getProteinGenes('calcium')
getProteinGenes('melatonin')
getProteinGenes('vitamin B12')
getProteinGenes('zinc')
getProteinGenes('magnesium')
getProteinGenes('fish oil')
getProteinGenes('omega 3s')
getProteinGenes('adrenaline')
getProteinGenes('corticosteroids')
getProteinGenes('calcitonine')
getProteinGenes('iron')
vitD <- read.csv('Top25vitamin-ds.csv')
melanin <- read.csv('Top25melanins.csv')
vitC <- read.csv('Top25vitamin-cs.csv')
glucose <- read.csv('Top25glucoses.csv')
insulin <- read.csv('Top25insulins.csv')
glucagon <- read.csv('Top25glucagons.csv')
dopamine <- read.csv('Top25dopamines.csv')
estrogen <- read.csv('Top25estrogens.csv')
progesterone <- read.csv('Top25progesterones.csv')
prolactin <- read.csv('Top25prolactins.csv')
testosterone <- read.csv('Top25testosterones.csv')
calcium <- read.csv('Top25calciums.csv')
melatonin <- read.csv('Top25melatonins.csv')
vitB12 <- read.csv('Top25vitamin-b12s.csv')
zinc <- read.csv('Top25zincs.csv')
magnesium <- read.csv('Top25magnesiums.csv')
fishOil <- read.csv('Top25fish-oils.csv')
omega3s <- read.csv('Top25omega-3ss.csv')
adrenaline <- read.csv('Top25adrenalines.csv')
corticosteroid <- read.csv('Top25corticosteroidss.csv')
calcitonine <- read.csv('Top25calcitonines.csv')
iron <- read.csv('Top25irons.csv')
Lets only take the top 3 from each data frame of mineral, vitamin, or steroid.
vitMinSter <- rbind(vitD[1:3,1:2],melanin[1:3,1:2],
vitC[1:3,1:2],glucose[1:3,1:2],
insulin[1:3,1:2],glucagon[1:3,1:2],
dopamine[1:3,1:2],estrogen[1:3,1:2],
progesterone[1:3,1:2],prolactin[1:3,1:2],
testosterone[1:3,1:2],calcium[1:3,1:2],
calcitonine[1:3,1:2],melatonin[1:3,1:2],
vitB12[1:3,1:2],zinc[1:3,1:2],magnesium[1:3,1:2],
fishOil[1:3,1:2],omega3s[1:3,1:2],
adrenaline[1:3,1:2],iron[1:3,1:2],
corticosteroid[1:3,1:2])
head(vitMinSter)
## proteinType proteinSearched
## 1 VDR vitamin-d
## 2 CYP27B1 vitamin-d
## 3 PHEX vitamin-d
## 4 TYR melanin
## 5 TYRP1 melanin
## 6 OCA2 melanin
Some of the genes associated with one vitamin also associate with another. We will keep them this way for the visualizations or charting.We could make a link analysis with these genes that are associated with other vitamins and minerals, but if not then you should.
Lets now get the gene summaries of these genes.
for (i in vitMinSter$proteinType){
getSummaries2(i,'protein')
}
getGeneSummaries('protein')
I only want the 2nd through 6th features.
vitMinSterSumms <- read.csv("proteinGeneSummaries_protein.csv"
)
vitMinSterSumms2 <- vitMinSterSumms[,2:6]
head(vitMinSterSumms2)
## gene EnsemblID
## 1 VDR ENSG00000111424
## 2 CYP27B1 ENSG00000111012
## 3 PHEX ENSG00000102174
## 4 TYR ENSG00000077498
## 5 TYRP1 ENSG00000107165
## 6 OCA2 ENSG00000104044
## EntrezSummary
## 1 This gene encodes vitamin D3 receptor, which is a member of the nuclear hormone receptor superfamily of ligand-inducible transcription factors. This receptor also functions as a receptor for the secondary bile acid, lithocholic acid. Downstream targets of vitamin D3 receptor are principally involved in mineral metabolism, though this receptor regulates a variety of other metabolic pathways, such as those involved in immune response and cancer. Mutations in this gene are associated with type II vitamin D-resistant rickets. A single nucleotide polymorphism in the initiation codon results in an alternate translation start site three codons downstream. Alternatively spliced transcript variants encoding different isoforms have been described for this gene. A recent study provided evidence for translational readthrough in this gene, and expression of an additional C-terminally extended isoform via the use of an alternative in-frame translation termination codon. [provided by RefSeq, Jun 2018]
## 2 This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. The protein encoded by this gene localizes to the inner mitochondrial membrane where it hydroxylates 25-hydroxyvitamin D3 at the 1alpha position. This reaction synthesizes 1alpha,25-dihydroxyvitamin D3, the active form of vitamin D3, which binds to the vitamin D receptor and regulates calcium metabolism. Thus this enzyme regulates the level of biologically active vitamin D and plays an important role in calcium homeostasis. Mutations in this gene can result in vitamin D-dependent rickets type I. [provided by RefSeq, Jul 2008]
## 3 The protein encoded by this gene is a transmembrane endopeptidase that belongs to the type II integral membrane zinc-dependent endopeptidase family. The protein is thought to be involved in bone and dentin mineralization and renal phosphate reabsorption. Mutations in this gene cause X-linked hypophosphatemic rickets. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Sep 2013]
## 4 The enzyme encoded by this gene catalyzes the first 2 steps, and at least 1 subsequent step, in the conversion of tyrosine to melanin. The enzyme has both tyrosine hydroxylase and dopa oxidase catalytic activities, and requires copper for function. Mutations in this gene result in oculocutaneous albinism, and nonpathologic polymorphisms result in skin pigmentation variation. The human genome contains a pseudogene similar to the 3' half of this gene. [provided by RefSeq, Oct 2008]
## 5 This gene encodes a melanosomal enzyme that belongs to the tyrosinase family and plays an important role in the melanin biosynthetic pathway. Defects in this gene are the cause of rufous oculocutaneous albinism and oculocutaneous albinism type III. [provided by RefSeq, Mar 2009]
## 6 This gene encodes the human homolog of the mouse p (pink-eyed dilution) gene. The encoded protein is believed to be an integral membrane protein involved in small molecule transport, specifically tyrosine, which is a precursor to melanin synthesis. It is involved in mammalian pigmentation, where it may control skin color variation and act as a determinant of brown or blue eye color. Mutations in this gene result in type 2 oculocutaneous albinism. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jul 2014]
## GeneCardsSummary
## 1 VDR (Vitamin D Receptor) is a Protein Coding gene. Diseases associated with VDR include Vitamin D-Dependent Rickets, Type 2A and Rickets. Among its related pathways are Development_Hedgehog and PTH signaling pathways in bone and cartilage development and Tuberculosis. Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and steroid hormone receptor activity. An important paralog of this gene is NR1I2.
## 2 CYP27B1 (Cytochrome P450 Family 27 Subfamily B Member 1) is a Protein Coding gene. Diseases associated with CYP27B1 include Vitamin D Hydroxylation-Deficient Rickets, Type 1A and Hypocalcemic Vitamin D-Dependent Rickets. Among its related pathways are Cytochrome P450 - arranged by substrate type and Tuberculosis. Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen. An important paralog of this gene is CYP27A1.
## 3 PHEX (Phosphate Regulating Endopeptidase Homolog X-Linked) is a Protein Coding gene. Diseases associated with PHEX include Hypophosphatemic Rickets, X-Linked Dominant and Hypophosphatemic Rickets, X-Linked Recessive. Gene Ontology (GO) annotations related to this gene include metalloendopeptidase activity and aminopeptidase activity. An important paralog of this gene is MMEL1.
## 4 TYR (Tyrosinase) is a Protein Coding gene. Diseases associated with TYR include Albinism, Oculocutaneous, Type Ia and Albinism, Oculocutaneous, Type Ib. Among its related pathways are (S)-reticuline biosynthesis and Tyrosine metabolism. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is TYRP1.
## 5 TYRP1 (Tyrosinase Related Protein 1) is a Protein Coding gene. Diseases associated with TYRP1 include Albinism, Oculocutaneous, Type Iii and Skin/Hair/Eye Pigmentation, Variation In, 11. Among its related pathways are Aldosterone synthesis and secretion and Viral mRNA Translation. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is DCT.
## 6 OCA2 (OCA2 Melanosomal Transmembrane Protein) is a Protein Coding gene. Diseases associated with OCA2 include Albinism, Oculocutaneous, Type Ii and Skin/Hair/Eye Pigmentation, Variation In, 1. Among its related pathways are Viral mRNA Translation and Metabolism. Gene Ontology (GO) annotations related to this gene include transporter activity and L-tyrosine transmembrane transporter activity. An important paralog of this gene is SLC13A2.
## UniProtKB_Summary
## 1 Nuclear receptor for calcitriol, the active form of vitamin D3 which mediates the action of this vitamin on cells (PubMed:28698609, PubMed:16913708, PubMed:15728261, PubMed:10678179). Enters the nucleus upon vitamin D3 binding where it forms heterodimers with the retinoid X receptor/RXR (PubMed:28698609). The VDR-RXR heterodimers bind to specific response elements on DNA and activate the transcription of vitamin D3-responsive target genes (PubMed:28698609). Plays a central role in calcium homeostasis (By similarity).\n VDR_HUMAN,P11473\n
## 2 A cytochrome P450 monooxygenase involved in vitamin D metabolism and in calcium and phosphorus homeostasis. Catalyzes the rate-limiting step in the activation of vitamin D in the kidney, namely the hydroxylation of 25-hydroxyvitamin D3/calcidiol at the C1alpha-position to form the hormonally active form of vitamin D3, 1alpha,25-dihydroxyvitamin D3/calcitriol that acts via the vitamin D receptor (VDR) (PubMed:10518789, PubMed:9486994, PubMed:22862690, PubMed:10566658, PubMed:12050193). Has 1alpha-hydroxylase activity on vitamin D intermediates of the CYP24A1-mediated inactivation pathway (PubMed:10518789, PubMed:22862690). Converts 24R,25-dihydroxyvitamin D3/secalciferol to 1-alpha,24,25-trihydroxyvitamin D3, an active ligand of VDR. Also active on 25-hydroxyvitamin D2 (PubMed:10518789). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via FDXR/adrenodoxin reductase and FDX1/adrenodoxin (PubMed:22862690).\n CP27B_HUMAN,O15528\n
## 3 Peptidase that cleaves SIBLING (small integrin-binding ligand, N-linked glycoprotein)-derived ASARM peptides, thus regulating their biological activity (PubMed:9593714, PubMed:15664000, PubMed:18162525, PubMed:18597632). Cleaves ASARM peptides between Ser and Glu or Asp residues (PubMed:18597632). Regulates osteogenic cell differentiation and bone mineralization through the cleavage of the MEPE-derived ASARM peptide (PubMed:18597632). Promotes dentin mineralization and renal phosphate reabsorption by cleaving DMP1- and MEPE-derived ASARM peptides (PubMed:18597632, PubMed:18162525). Inhibits the cleavage of MEPE by CTSB/cathepsin B thus preventing MEPE degradation (PubMed:12220505).\n PHEX_HUMAN,P78562\n
## 4 This is a copper-containing oxidase that functions in the formation of pigments such as melanins and other polyphenolic compounds. Catalyzes the initial and rate limiting step in the cascade of reactions leading to melanin production from tyrosine. In addition to hydroxylating tyrosine to DOPA (3,4-dihydroxyphenylalanine), also catalyzes the oxidation of DOPA to DOPA-quinone, and possibly the oxidation of DHI (5,6-dihydroxyindole) to indole-5,6 quinone.\n TYRO_HUMAN,P14679\n
## 5 Plays a role in melanin biosynthesis (PubMed:22556244, PubMed:16704458). Catalyzes the oxidation of 5,6-dihydroxyindole-2-carboxylic acid (DHICA) into indole-5,6-quinone-2-carboxylic acid in the presence of bound Cu(2+) ions, but not in the presence of Zn(2+) (PubMed:28661582). May regulate or influence the type of melanin synthesized (PubMed:22556244, PubMed:16704458). Also to a lower extent, capable of hydroxylating tyrosine and producing melanin (By similarity).\n TYRP1_HUMAN,P17643\n
## 6 Could be involved in the transport of tyrosine, the precursor to melanin synthesis, within the melanocyte. Regulates the pH of melanosome and the melanosome maturation. One of the components of the mammalian pigmentary system. Seems to regulate the post-translational processing of tyrosinase, which catalyzes the limiting reaction in melanin synthesis. May serve as a key control point at which ethnic skin color variation is determined. Major determinant of brown and/or blue eye color.\n P_HUMAN,Q04671\n
Combine the vitamin searched with the gene from the last two data frames.
vitamins <- merge(vitMinSter,vitMinSterSumms2,
by.x='proteinType',
by.y='gene')
vitamins2 <- vitamins[!duplicated(vitamins),]
head(vitamins2)
## proteinType proteinSearched EnsemblID
## 1 AANAT melatonin ENSG00000129673
## 2 ANKH calcium ENSG00000154122
## 3 APOA1 fish-oil ENSG00000118137
## 4 APOB fish-oil ENSG00000084674
## 5 CACNA1B omega-3s ENSG00000148408
## 6 CALCA calcitonine ENSG00000110680
## EntrezSummary
## 1 The protein encoded by this gene belongs to the acetyltransferase superfamily. It is the penultimate enzyme in melatonin synthesis and controls the night/day rhythm in melatonin production in the vertebrate pineal gland. Melatonin is essential for the function of the circadian clock that influences activity and sleep. This enzyme is regulated by cAMP-dependent phosphorylation that promotes its interaction with 14-3-3 proteins and thus protects the enzyme against proteasomal degradation. This gene may contribute to numerous genetic diseases such as delayed sleep phase syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Oct 2009]
## 2 This gene encodes a multipass transmembrane protein that is expressed in joints and other tissues and controls pyrophosphate levels in cultured cells. Progressive ankylosis-mediated control of pyrophosphate levels has been suggested as a possible mechanism regulating tissue calcification and susceptibility to arthritis in higher animals. Mutations in this gene have been associated with autosomal dominant craniometaphyseal dysplasia. [provided by RefSeq, Jul 2008]
## 3 This gene encodes apolipoprotein A-I, which is the major protein component of high density lipoprotein (HDL) in plasma. The encoded preproprotein is proteolytically processed to generate the mature protein, which promotes cholesterol efflux from tissues to the liver for excretion, and is a cofactor for lecithin cholesterolacyltransferase (LCAT), an enzyme responsible for the formation of most plasma cholesteryl esters. This gene is closely linked with two other apolipoprotein genes on chromosome 11. Defects in this gene are associated with HDL deficiencies, including Tangier disease, and with systemic non-neuropathic amyloidosis. Alternative splicing results in multiple transcript variants, at least one of which encodes a preproprotein. [provided by RefSeq, Dec 2015]
## 4 This gene product is the main apolipoprotein of chylomicrons and low density lipoproteins (LDL), and is the ligand for the LDL receptor. It occurs in plasma as two main isoforms, apoB-48 and apoB-100: the former is synthesized exclusively in the gut and the latter in the liver. The intestinal and the hepatic forms of apoB are encoded by a single gene from a single, very long mRNA. The two isoforms share a common N-terminal sequence. The shorter apoB-48 protein is produced after RNA editing of the apoB-100 transcript at residue 2180 (CAA->UAA), resulting in the creation of a stop codon, and early translation termination. Mutations in this gene or its regulatory region cause hypobetalipoproteinemia, normotriglyceridemic hypobetalipoproteinemia, and hypercholesterolemia due to ligand-defective apoB, diseases affecting plasma cholesterol and apoB levels. [provided by RefSeq, Dec 2019]
## 5 The protein encoded by this gene is the pore-forming subunit of an N-type voltage-dependent calcium channel, which controls neurotransmitter release from neurons. The encoded protein forms a complex with alpha-2, beta, and delta subunits to form the high-voltage activated channel. This channel is sensitive to omega-conotoxin-GVIA and omega-agatoxin-IIIA but insensitive to dihydropyridines. Two transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Aug 2011]
## 6 This gene encodes the peptide hormones calcitonin, calcitonin gene-related peptide and katacalcin by tissue-specific alternative RNA splicing of the gene transcripts and cleavage of inactive precursor proteins. Calcitonin is involved in calcium regulation and acts to regulate phosphorus metabolism. Calcitonin gene-related peptide functions as a vasodilator and as an antimicrobial peptide while katacalcin is a calcium-lowering peptide. Multiple transcript variants encoding different isoforms have been found for this gene.[provided by RefSeq, Aug 2014]
## GeneCardsSummary
## 1 AANAT (Aralkylamine N-Acetyltransferase) is a Protein Coding gene. Diseases associated with AANAT include Dissociative Amnesia and Baastrup's Syndrome. Among its related pathways are superpathway of tryptophan utilization and Tryptophan metabolism. Gene Ontology (GO) annotations related to this gene include N-acetyltransferase activity and arylamine N-acetyltransferase activity.
## 2 ANKH (ANKH Inorganic Pyrophosphate Transport Regulator) is a Protein Coding gene. Diseases associated with ANKH include Craniometaphyseal Dysplasia, Autosomal Dominant and Chondrocalcinosis 2. Among its related pathways are Transport of glucose and other sugars, bile salts and organic acids, metal ions and amine compounds and Miscellaneous transport and binding events. Gene Ontology (GO) annotations related to this gene include inorganic phosphate transmembrane transporter activity and inorganic diphosphate transmembrane transporter activity.
## 3 APOA1 (Apolipoprotein A1) is a Protein Coding gene. Diseases associated with APOA1 include Hypoalphalipoproteinemia, Primary, 2 and Amyloidosis, Familial Visceral. Among its related pathways are Lipoprotein metabolism and Folate Metabolism. Gene Ontology (GO) annotations related to this gene include identical protein binding and lipid binding. An important paralog of this gene is APOA4.
## 4 APOB (Apolipoprotein B) is a Protein Coding gene. Diseases associated with APOB include Hypobetalipoproteinemia, Familial, 1 and Hypercholesterolemia, Familial, 2. Among its related pathways are Activated TLR4 signalling and Lipoprotein metabolism. Gene Ontology (GO) annotations related to this gene include binding and heparin binding.
## 5 CACNA1B (Calcium Voltage-Gated Channel Subunit Alpha1 B) is a Protein Coding gene. Diseases associated with CACNA1B include Neurodevelopmental Disorder With Seizures And Nonepileptic Hyperkinetic Movements and Undetermined Early-Onset Epileptic Encephalopathy. Among its related pathways are Nicotine addiction and ADP signalling through P2Y purinoceptor 12. Gene Ontology (GO) annotations related to this gene include calcium ion binding and ion channel activity. An important paralog of this gene is CACNA1A.
## 6 CALCA (Calcitonin Related Polypeptide Alpha) is a Protein Coding gene. Diseases associated with CALCA include Reflex Sympathetic Dystrophy and Spinal Stenosis. Among its related pathways are Neuroscience and Signaling by GPCR. Gene Ontology (GO) annotations related to this gene include identical protein binding. An important paralog of this gene is CALCB.
## UniProtKB_Summary
## 1 Controls the night/day rhythm of melatonin production in the pineal gland. Catalyzes the N-acetylation of serotonin into N-acetylserotonin, the penultimate step in the synthesis of melatonin.\n SNAT_HUMAN,Q16613\n
## 2 Regulates intra- and extracellular levels of inorganic pyrophosphate (PPi), probably functioning as PPi transporter.\n ANKH_HUMAN,Q9HCJ1\n
## 3 Participates in the reverse transport of cholesterol from tissues to the liver for excretion by promoting cholesterol efflux from tissues and by acting as a cofactor for the lecithin cholesterol acyltransferase (LCAT). As part of the SPAP complex, activates spermatozoa motility.\n APOA1_HUMAN,P02647\n
## 4 Apolipoprotein B is a major protein constituent of chylomicrons (apo B-48), LDL (apo B-100) and VLDL (apo B-100). Apo B-100 functions as a recognition signal for the cellular binding and internalization of LDL particles by the apoB/E receptor.\n APOB_HUMAN,P04114\n
## 5 Voltage-sensitive calcium channels (VSCC) mediate the entry of calcium ions into excitable cells and are also involved in a variety of calcium-dependent processes, including muscle contraction, hormone or neurotransmitter release, gene expression, cell motility, cell division and cell death. The isoform alpha-1B gives rise to N-type calcium currents. N-type calcium channels belong to the 'high-voltage activated' (HVA) group and are specifically blocked by omega-conotoxin-GVIA (AC P01522) (AC P01522) (By similarity). They are however insensitive to dihydropyridines (DHP). Calcium channels containing alpha-1B subunit may play a role in directed migration of immature neurons.\n CAC1B_HUMAN,Q00975\n
## 6 CGRP induces vasodilation. It dilates a variety of vessels including the coronary, cerebral and systemic vasculature. Its abundance in the CNS also points toward a neurotransmitter or neuromodulator role. It also elevates platelet cAMP.\n CALCA_HUMAN,P06881\n
Lets now combine this dataframe with the Ensembl ID of the dataAllFCs dataframe.
vitaminsDF <- merge(vitamins2,dataAllFCs,
by.x='EnsemblID',
by.y='ENSEMBLID')
head(vitaminsDF)
## EnsemblID proteinType proteinSearched
## 1 ENSG00000004948 CALCR calcitonine
## 2 ENSG00000007171 NOS2 omega-3s
## 3 ENSG00000010704 HFE iron
## 4 ENSG00000036828 CASR calcium
## 5 ENSG00000064835 POU1F1 prolactin
## 6 ENSG00000064989 CALCRL calcitonine
## EntrezSummary
## 1 This gene encodes a high affinity receptor for the peptide hormone calcitonin and belongs to a subfamily of seven transmembrane-spanning G protein-coupled receptors. The encoded protein is involved in maintaining calcium homeostasis and in regulating osteoclast-mediated bone resorption. Polymorphisms in this gene have been associated with variations in bone mineral density and onset of osteoporosis. Alternate splicing results in multiple transcript variants. [provided by RefSeq, Sep 2009]
## 2 Nitric oxide is a reactive free radical which acts as a biologic mediator in several processes, including neurotransmission and antimicrobial and antitumoral activities. This gene encodes a nitric oxide synthase which is expressed in liver and is inducible by a combination of lipopolysaccharide and certain cytokines. Three related pseudogenes are located within the Smith-Magenis syndrome region on chromosome 17. [provided by RefSeq, Jul 2008]
## 3 The protein encoded by this gene is a membrane protein that is similar to MHC class I-type proteins and associates with beta2-microglobulin (beta2M). It is thought that this protein functions to regulate iron absorption by regulating the interaction of the transferrin receptor with transferrin. The iron storage disorder, hereditary haemochromatosis, is a recessive genetic disorder that results from defects in this gene. At least nine alternatively spliced variants have been described for this gene. Additional variants have been found but their full-length nature has not been determined. [provided by RefSeq, Jul 2008]
## 4 The protein encoded by this gene is a plasma membrane G protein-coupled receptor that senses small changes in circulating calcium concentration. The encoded protein couples this information to intracellular signaling pathways that modify parathyroid hormone secretion or renal cation handling, and thus this protein plays an essential role in maintaining mineral ion homeostasis. Mutations in this gene are a cause of familial hypocalciuric hypercalcemia, neonatal severe hyperparathyroidism, and autosomal dominant hypocalcemia. [provided by RefSeq, Aug 2017]
## 5 This gene encodes a member of the POU family of transcription factors that regulate mammalian development. The protein regulates expression of several genes involved in pituitary development and hormone expression. Mutations in this genes result in combined pituitary hormone deficiency. Multiple transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2008]
## 6 CALCRL (Calcitonin Receptor Like Receptor) is a Protein Coding gene. Diseases associated with CALCRL include Lymphatic Malformation 8 and Primary Angle-Closure Glaucoma. Among its related pathways are Signaling by GPCR and G alpha (s) signalling events. Gene Ontology (GO) annotations related to this gene include G protein-coupled receptor activity and protein transporter activity. An important paralog of this gene is CALCR.
## GeneCardsSummary
## 1 CALCR (Calcitonin Receptor) is a Protein Coding gene. Diseases associated with CALCR include Osteoporosis and Bone Mineral Density Quantitative Trait Locus 15. Among its related pathways are Signaling by GPCR and G alpha (s) signalling events. Gene Ontology (GO) annotations related to this gene include G protein-coupled receptor activity and transmembrane signaling receptor activity. An important paralog of this gene is CALCRL.
## 2 NOS2 (Nitric Oxide Synthase 2) is a Protein Coding gene. Diseases associated with NOS2 include Malaria and Meningioma, Radiation-Induced. Among its related pathways are Tuberculosis and VEGF Signaling. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and oxidoreductase activity. An important paralog of this gene is NOS1.
## 3 HFE (Homeostatic Iron Regulator) is a Protein Coding gene. Diseases associated with HFE include Hemochromatosis, Type 1 and Transferrin Serum Level Quantitative Trait Locus 2. Among its related pathways are Hfe effect on hepcidin production. Gene Ontology (GO) annotations related to this gene include signaling receptor binding and peptide antigen binding. An important paralog of this gene is HLA-A.
## 4 CASR (Calcium Sensing Receptor) is a Protein Coding gene. Diseases associated with CASR include Hypocalcemia, Autosomal Dominant 1 and Hyperparathyroidism, Neonatal Severe. Among its related pathways are RET signaling and Signaling by GPCR. Gene Ontology (GO) annotations related to this gene include G protein-coupled receptor activity and protein kinase binding. An important paralog of this gene is GPRC6A.
## 5 POU1F1 (POU Class 1 Homeobox 1) is a Protein Coding gene. Diseases associated with POU1F1 include Pituitary Hormone Deficiency, Combined, 1 and Isolated Growth Hormone Deficiency, Type Ii. Among its related pathways are Relaxin signaling pathway and Glucocorticoid receptor regulatory network. Gene Ontology (GO) annotations related to this gene include DNA-binding transcription factor activity and chromatin binding. An important paralog of this gene is POU3F3.
## 6 CALCRL (Calcitonin Receptor Like Receptor) is a Protein Coding gene. Diseases associated with CALCRL include Lymphatic Malformation 8 and Primary Angle-Closure Glaucoma. Among its related pathways are Signaling by GPCR and G alpha (s) signalling events. Gene Ontology (GO) annotations related to this gene include G protein-coupled receptor activity and protein transporter activity. An important paralog of this gene is CALCR.
## UniProtKB_Summary
## 1 This is a receptor for calcitonin. The activity of this receptor is mediated by G proteins which activate adenylyl cyclase. The calcitonin receptor is thought to couple to the heterotrimeric guanosine triphosphate-binding protein that is sensitive to cholera toxin.\n CALCR_HUMAN,P30988\n
## 2 Produces nitric oxide (NO) which is a messenger molecule with diverse functions throughout the body (PubMed:7531687, PubMed:7544004). In macrophages, NO mediates tumoricidal and bactericidal actions. Also has nitrosylase activity and mediates cysteine S-nitrosylation of cytoplasmic target proteins such PTGS2/COX2 (By similarity). As component of the iNOS-S100A8/9 transnitrosylase complex involved in the selective inflammatory stimulus-dependent S-nitrosylation of GAPDH on 'Cys-247' implicated in regulation of the GAIT complex activity and probably multiple targets including ANXA5, EZR, MSN and VIM (PubMed:25417112). Involved in inflammation, enhances the synthesis of proinflammatory mediators such as IL6 and IL8 (PubMed:19688109).\n NOS2_HUMAN,P35228\n
## 3 Binds to transferrin receptor (TFR) and reduces its affinity for iron-loaded transferrin.\n HFE_HUMAN,Q30201\n
## 4 G-protein-coupled receptor that senses changes in the extracellular concentration of calcium ions and plays a key role in maintaining calcium homeostasis (PubMed:7759551, PubMed:8702647, PubMed:8636323, PubMed:8878438, PubMed:17555508, PubMed:19789209, PubMed:21566075, PubMed:22114145, PubMed:23966241, PubMed:25292184, PubMed:25104082, PubMed:26386835, PubMed:25766501, PubMed:22789683). Senses fluctuations in the circulating calcium concentration and modulates the production of parathyroid hormone (PTH) in parathyroid glands (By similarity). The activity of this receptor is mediated by a G-protein that activates a phosphatidylinositol-calcium second messenger system (PubMed:7759551). The G-protein-coupled receptor activity is activated by a co-agonist mechanism: aromatic amino acids, such as Trp or Phe, act concertedly with divalent cations, such as calcium or magnesium, to achieve full receptor activation (PubMed:27434672, PubMed:27386547).\n CASR_HUMAN,P41180\n
## 5 Transcription factor involved in the specification of the lactotrope, somatotrope, and thyrotrope phenotypes in the developing anterior pituitary. Specifically binds to the consensus sequence 5'-TAAAT-3'. Activates growth hormone and prolactin genes (PubMed:22010633, PubMed:26612202).\n PIT1_HUMAN,P28069\n
## 6 Receptor for calcitonin-gene-related peptide (CGRP) together with RAMP1 and receptor for adrenomedullin together with RAMP3 (By similarity). Receptor for adrenomedullin together with RAMP2. The activity of this receptor is mediated by G proteins which activate adenylyl cyclase.\n CALRL_HUMAN,Q16602\n
## convalescent healthy1 healthy2 healthy3 healthy4 healthy5 healthy6 healthy7
## 1 0 2 3 0 0 1 2 4
## 2 3 0 4 0 0 1 2 2
## 3 89 98 146 115 93 149 178 97
## 4 2 7 6 4 1 0 2 5
## 5 0 0 0 0 1 0 2 0
## 6 72 27 50 41 28 78 171 61
## healthy8 healthy9 healthy10 healthy11 healthy12 healthy13 healthy14 healthy15
## 1 1 0 2 0 2 3 0 2
## 2 4 5 3 0 6 2 1 7
## 3 81 121 111 83 64 68 102 80
## 4 5 1 2 4 2 6 4 2
## 5 0 0 0 0 2 0 0 0
## 6 71 70 59 67 73 66 75 79
## healthy16 healthy17 moderate1 moderate2 moderate3 moderate4 severe1 severe2
## 1 0 0 3 1 1 0 3 0
## 2 1 2 1 1 3 5 6 2
## 3 71 43 46 65 67 33 79 78
## 4 0 2 5 3 5 5 6 2
## 5 0 0 0 0 0 1 0 0
## 6 56 61 37 10 14 32 23 37
## severe3 severe4 severe5 severe6 severe7 severe8 ICU1 ICU2 ICU3 ICU4
## 1 0 4 0 0 1 1 0 3 3 1
## 2 5 2 1 0 2 1 0 11 4 1
## 3 57 104 94 95 90 60 51 86 71 65
## 4 5 5 4 4 6 2 0 5 16 4
## 5 0 0 1 1 0 0 0 0 3 0
## 6 47 149 58 71 54 21 43 36 36 37
## healthy_mean moderate_mean severe_mean ICU_mean mod_health_foldChange
## 1 1.2941176 1.25 1.125 1.75 0.9659091
## 2 2.3529412 2.50 2.375 4.00 1.0625000
## 3 100.0000000 52.75 82.125 68.25 0.5275000
## 4 3.1176471 4.50 4.250 6.25 1.4433962
## 5 0.2941176 0.25 0.250 0.75 0.8500000
## 6 66.6470588 23.25 57.500 38.00 0.3488526
## sevr_health_foldChange ICU_health_foldChange monotonicIncrease
## 1 0.8693182 1.3522727 0
## 2 1.0093750 1.7000000 0
## 3 0.8212500 0.6825000 0
## 4 1.3632075 2.0047170 0
## 5 0.8500000 2.5500000 0
## 6 0.8627538 0.5701677 0
## monotonicDecrease
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
We are ready to make are charts from this data to analyze any pattern, relationships, outlier samples, and make hypothesese on. Lets write this data frame out to csv.
write.csv(vitaminsDF,'vitaminsDF.csv',row.names=FALSE)
Lets also attach the header demographic information and tidy the sample types into one column.
vitDF <- gather(vitaminsDF,key='sampleName',value='sampleValue',7:40)
vitDF2 <- merge(HeaderInformation,vitDF,by.x='CN_new',
by.y='sampleName')
write.csv(vitDF2,'vitaminsTidyDF.csv',row.names=F)
I have created some visualizations that were uploaded to Tableau Public Server to view for these genes as a bar with fold change values, those genes that are only increasing in fold change by severity, and then those that are only decreasing in fold change by severity from least to worst or moderate to severe to ICU. The scatter plot has the age and gender as added dimensions, and all summaries of the genes are displayed when hovering over that feature.
<a href=‘https://public.tableau.com/profile/janis5126#!/vizhome/vitMinSterAllGenesBarchart/vitMinSterAllGenesBarchart?publish=yes’,target=‘blank’>All genes as a bar chart of fold change values then a scatter with raw values as a comparision with age and gender and sample ID:
barchart all Vitamin Genes’ Fold Change Values
A scatter plot by age, gender, and raw sample ID values with the gene and gene summary information:
vitamin Genes’ scatter by age and gender
A bar chart of those genes that are in the vitamin body systems as a bar chart of fold change values by severity of COVID-19 for those genes that are only increasing in fold change from lowest to worst severity.
bar chart monotonically increasing vitamin genes
bar chart monotonically increasing vitamin genes
scatter plot of the monotonically increasing vitamin genes
scatter plot of the monotonically increasing vitamin genes
bar chart of those vitamin genes that are monotonically decreasing
bar chart of those vitamin genes that are monotonically decreasing
Scatter plot of those vitamin genes monotonically decreasing
scatter plot of those vitamin genes monotonically decreasing
There are many relationships that can be seen and understood by looking at the gene trends within the proteins and those increasing or decreasing by severity of COVID-19 when viewing the interactive plots in Tableau at the links provided before each image above.
After adding the iron genes to our data, it was charted by fold change values and also by raw values, age, gender, and severity of COVID-19 disease.
Scatter of gene values by severity, age, and gender
scatter of genes
The image above is the chart of genes in Tableau that has only the iron actual raw values for its top 3 ranked genes in our samples. We can see that the first gene is much higher over all in expression in healthy samples, and that in all three grades of severity of COVID-19 the iron levels are on the smaller range decreasing with age in both genders and all cases excep the ICU grade for females where it increased with age. The 2nd iron gene is similar except that iron increases with age in females. The last iron gene shows that it is lower in healthy samples and than the other two iron genes, but in ICU grades of COVID-19 it is increased.
non-monotonic gene fold changes-iron
nonMonotonic Genes-Iron
This image above is of those iron genes that aren’t monotonically increasing or decreasing. There fold change values are shown for each gene and all are under expressed in the disease state compared to the healthy state. We created the rules for these fold change values so that if the disease state mean was 0 and the healthy mean was not 0, the fold change would be 1-healthy mean, and if both the disease state and the healthy state means were 0 then the fold change value would be 1, and finally if the healthy mean was 0 but the disease mean was not 0 the value would be the disease state. And if the fold change was not 0, Inf, or NaN, then the value would stay as is from the disease state/healthy state ratio of mean values. We can see from the image of the Tableau chart, that the iron gene was clicked on and selected to ‘keep only’ so that only the iron genes are displayed among those nonmonotonic. HFE and SLC11A2 are shown, and the ICU grade of COVID-19 severity is lower than the severe grade of COVID-19, but the moderate grade is lower than the ICU grade in the HFE gene and higher in the SLC11A2 gene. SLC11A2 is involved in transporter activity and copper ion binding and associated with anemia (not enough red blood cells). While HFE is associated with hemochromatosis (too many red blood cells) and signaling receptor binding as well as peptide antigen binding.
monotonically decreasing fold change values-iron
monotonic increasing fold changes-iron
The image above is of those genes monotonically increasing but with iron selected as the ‘keep only’ gene to be displayed. We can see the description of this gene that is one of the top three ranked iron genes, SLC40A1, as associated with hemochromatosis Type I and Type 4, iron ion transmembrane transporter activity, and its related pathways include mineral absorption and insulin receptor recycling.
monotonically increasing fold change values-iron not included
monotonically increasing fold changes
We already have this chart, because it is the monotonically increasing genes from least to most severe grades of COVID-19, except that iron was added to the data set and coincidentally not in this chart because its three genes are split between the monotonically decreasing and the genes nonmonotonic. Look at the genes that are higher in the least severe case and dramatically lower in the most severe case. Most are hormones, but one is a fish oil gene and a couple of vitamin B12 genes that aren’t dramatically higher in the moderate compared to the ICU grade of COVID-19. Estrogen, progesterone, and testosterone are gender related hormones. The testosterone gene GNRH1 is produced in the hypothalamus which lies in front of the pituitary gland that produces estrogen and progesterone. Synthetic progesterone is consumed when eating legumes and yams. Adrenaline is the sympathetic hormone produced by the adrenal glands and released in times of extreme stress or fight or flight threats to personal safety. If you explore the chart further you will see that TNF is a corticosteroid and vitamin C gene that is dramatically higher in moderate cases than in ICU cases of COVID-19.