knitr::opts_chunk$set(cache = FALSE)
This markdown file for R is about myofascial pain and taken from the GSE215154 series data that found high throughput sequencing fastp data of 5 healthy samples compared to 7 samples of patients with triggerpoint myofascial pain. A search of the GSE215154 series data at NCBI gene studies will bring you to the series.
I manually downloaded each text file of the samples and the 7 text files for trigger point myofascial pain. This was the most recent study evaluating fibromyalgia patient genomic data I found a month or so ago and am now reviewing it in searching for top genes in myofascial pain or chronic pain patterns, mononucleosis, multiple sclerosis, Hodgkins disease, and Epstein-Barr infections that relate back to the common denominator at some point of life for other pathologies listed. This data science project focuses on chronic pain in those with myofascial pain patterns similar to fibromyalgia that currently does not have any diagnostic gene or associated genes or special lab testing to confirm disease. Currently only a questionnaire is available and monitoring over about 3 months while ruling out autoimmune disorders to diagnose fibromyalgia as a source of pain out of normal threshold values. This disorder was previously an ‘in your head’ disease that females got mostly and doctors would tell them to take OTC for pain and now get lifestyle in check to improve diet, sleep, stress, exercise, massage, adjustments, and meditation.
Lets read in the sample text files separately and combine them into one data frame. The sample information says that the aquisition of the data used a count of the fragments of each gene, then normalized it, and also added in the FPKM values as a final step. These are tab delimited files. When reading in the file per sample you see the gene as well as the count (of fragments then normalized) and FPKM value per gene. Internet search says the normalization is the FPKM values for the counts of fragments in RNA-seq data of transcriptional gene data using a fragment length. The gene length is multiplied by 10 billion or 10^9= 10,000,000,000, then divided by the transcripts of gene X gene length. More Information on FPKM is fragments per kilo million and TPM is transcripts per million and CPM is counts per million.
Lets load in our packages first to use later.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
healthy1 <- read.table('GSM6623971_C1.txt', header=T, sep="")
summary(healthy1)
## gene_id C1_FPKM C1_count
## Length:58302 Min. :0.000e+00 Min. : 0.0
## Class :character 1st Qu.:0.000e+00 1st Qu.: 0.0
## Mode :character Median :0.000e+00 Median : 0.0
## Mean :5.472e+00 Mean : 197.1
## 3rd Qu.:3.379e-01 3rd Qu.: 10.0
## Max. :3.058e+04 Max. :754337.0
head(healthy1)
Read in the other 4 healthy samples as tables
healthy2 <- read.table('GSM6623972_C2.txt', header=T, sep="")
healthy3 <- read.table('GSM6623973_C3.txt', header=T, sep="")
healthy4 <- read.table('GSM6623974_C4.txt', header=T, sep="")
healthy5 <- read.table('GSM6623975_C5.txt', header=T, sep="")
Lets column bind them as they all have the same number of rows and columns, but lets remove one of the variables, the counts, and keep FPKM or fragments per kilomillion of each gene. And remove the other gene columns other than the first table.
colnames(healthy1)[2] <- 'healthy1-FPKM'
colnames(healthy2)[2] <- 'healthy2-FPKM'
colnames(healthy3)[2] <- 'healthy3-FPKM'
colnames(healthy4)[2] <- 'healthy4-FPKM'
colnames(healthy5)[2] <- 'healthy5-FPKM'
healthy5samples <- cbind(healthy1[,-3], healthy2[,-c(1,3)],healthy3[,-c(1,3)], healthy4[,-c(1,3)], healthy5[,-c(1,3)])
colnames(healthy5samples) <- c('Gene','Healthy1','Healthy2','Healthy3','Healthy4','Healthy5')
head(healthy5samples)
Lets do the same with the myofacial 7 samples.
myofascial1 <- read.table('GSM6623976_M1.txt', header=T, sep="")
myofascial2 <- read.table('GSM6623977_M2.txt', header=T, sep="")
myofascial3 <- read.table('GSM6623978_M3.txt', header=T, sep="")
myofascial4 <- read.table('GSM6623979_M4.txt', header=T, sep="")
myofascial5 <- read.table('GSM6623980_M5.txt', header=T, sep="")
myofascial6 <- read.table('GSM6623981_M6.txt', header=T, sep="")
myofascial7 <- read.table('GSM6623982_M7.txt', header=T, sep="")
colnames(myofascial1)[2] <- 'myofascial1-FPKM'
colnames(myofascial2)[2] <- 'myofascial2-FPKM'
colnames(myofascial3)[2] <- 'myofascial3-FPKM'
colnames(myofascial4)[2] <- 'myofascial4-FPKM'
colnames(myofascial5)[2] <- 'myofascial5-FPKM'
colnames(myofascial6)[2] <- 'myofascial6-FPKM'
colnames(myofascial7)[2] <- 'myofascial7-FPKM'
myofascial7samples <- cbind(myofascial1[,-3], myofascial2[,-c(1,3)],myofascial3[,-c(1,3)], myofascial4[,-c(1,3)], myofascial5[,-c(1,3)], myofascial6[,-c(1,3)], myofascial7[,-c(1,3)])
colnames(myofascial7samples) <- c('Gene','myo1','myo2','myo3','myo4','myo5','myo6','myo7')
head(myofascial7samples)
Lets column bind the healthy 5 samples to the 7 myofascial pain samples with the Gene order they all have the same number of rows of.
dataHealthy5Myo7 <- cbind(healthy5samples,myofascial7samples[,-1])
head(dataHealthy5Myo7)
Summarize the data frame just made to find the mean of each sample’s normalized FPKM values within the 58,302 genes.
summary(dataHealthy5Myo7)
## Gene Healthy1 Healthy2 Healthy3
## Length:58302 Min. :0.000e+00 Min. :0.000e+00 Min. : 0.000
## Class :character 1st Qu.:0.000e+00 1st Qu.:0.000e+00 1st Qu.: 0.000
## Mode :character Median :0.000e+00 Median :0.000e+00 Median : 0.000
## Mean :5.472e+00 Mean :5.188e+00 Mean : 5.447
## 3rd Qu.:3.379e-01 3rd Qu.:6.688e-01 3rd Qu.: 0.464
## Max. :3.058e+04 Max. :2.376e+04 Max. :50031.764
## Healthy4 Healthy5 myo1
## Min. : 0.000 Min. : 0.000 Min. :0.000e+00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:0.000e+00
## Median : 0.000 Median : 0.000 Median :0.000e+00
## Mean : 5.764 Mean : 5.556 Mean :4.861e+00
## 3rd Qu.: 0.426 3rd Qu.: 0.467 3rd Qu.:7.898e-01
## Max. :37775.449 Max. :39829.283 Max. :1.542e+04
## myo2 myo3 myo4
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000 Median : 0.000
## Mean : 5.766 Mean : 5.738 Mean : 5.113
## 3rd Qu.: 0.443 3rd Qu.: 0.309 3rd Qu.: 0.376
## Max. :43180.511 Max. :40626.886 Max. :37108.454
## myo5 myo6 myo7
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000 Median : 0.000
## Mean : 5.415 Mean : 5.175 Mean : 5.864
## 3rd Qu.: 0.341 3rd Qu.: 0.371 3rd Qu.: 0.377
## Max. :41750.951 Max. :37369.240 Max. :41902.286
dim(dataHealthy5Myo7)
## [1] 58302 13
write.csv(dataHealthy5Myo7, 'fragmentsPerKMill_myofascial_5healthy_7myo_58kgenes.csv',row.names=F)
Get this file we made here file
Lets get the sample mean per gene in healthy and myofascial pain samples to compare their fold change values.
healthyMeans <- dataHealthy5Myo7 %>% group_by(Gene) %>%
mutate(
healthy_Mean = mean(Healthy1:Healthy2,na.rm=T),
myo_Mean = mean(myo1:myo7)
)
colnames(healthyMeans)
## [1] "Gene" "Healthy1" "Healthy2" "Healthy3" "Healthy4"
## [6] "Healthy5" "myo1" "myo2" "myo3" "myo4"
## [11] "myo5" "myo6" "myo7" "healthy_Mean" "myo_Mean"
Now get the Median of each sample group.
healthyMeansMedians <- healthyMeans %>% group_by(Gene) %>% mutate(healthyMedian=median(Healthy1:Healthy5, na.rm=T), myoMedian = median(myo1:myo7))
colnames(healthyMeansMedians)
## [1] "Gene" "Healthy1" "Healthy2" "Healthy3"
## [5] "Healthy4" "Healthy5" "myo1" "myo2"
## [9] "myo3" "myo4" "myo5" "myo6"
## [13] "myo7" "healthy_Mean" "myo_Mean" "healthyMedian"
## [17] "myoMedian"
head(healthyMeansMedians)
Looks like the median is the same for mean in the two samples from the first 5 entries in the data. Lets get the fold change of myofascial pain to healthy gene expression FPKM normalized values.
foldchangeHealthyMyo <- healthyMeansMedians %>% mutate(foldchangeHealthyVsMyo=(myo_Mean/(healthy_Mean+0.0000000000001)))
head(foldchangeHealthyMyo)
To avoid the NaN values of healthy 0 in denominator, I added a very very small value of one trillionth of 1. Now lets order the data by fold change from most to least and take the top 20 and bottom 20 genes and make a data frame to do some machine learning on with our 2 classes of healthy and myofascial pain.
FC_ordered <- foldchangeHealthyMyo[order(foldchangeHealthyMyo$foldchangeHealthyVsMyo, decreasing=T),]
head(FC_ordered)
tail(FC_ordered)
There may be a very vast number of genes with a zero value, so the bottom 20 won’t be revealing anything for the gene expression down regulation in myofascial pain.
sum(FC_ordered$foldchangeHealthyVsMyo==0)
## [1] 32916
There are 32,916 genes equal to zero in fold change. Since these values are normalized fragments per kilomillion, we will have to look at the bottom values above 0 that are lowest.
sum(FC_ordered$foldchangeHealthyVsMyo < 0)
## [1] 0
There are not any negative values. but more than 32,000 genes with 0 fold change, won’t give much information if ordering in decreasing order and last 32,000 genes have no information for fold change.
minFC <- subset(FC_ordered, FC_ordered$foldchangeHealthyVsMyo > 0)
tail(minFC)
We removed those 32,000 genes that had no information to add in fold change values for this RNA-seq data, and still have 25,386 genes left. We can now take the bottom 20 and top 20 to make a data frame to use in training a model to predict the best in our 2 classes of healthy or myofascial pain.
minMax20_FC <- minFC[c(1:20,25367:25386),]
dim(minMax20_FC)
## [1] 40 18
That gives us 40 genes to work with. Lets see those genes.These are Ensemble genes. They may or may not have alternate Genecard.org gene names but should have a summary on Genecards.org that we can read all about once we get our top genes.
minMax20_FC$Gene
## [1] "ENSG00000267727" "ENSG00000178372" "ENSG00000172867" "ENSG00000188508"
## [5] "ENSG00000178934" "ENSG00000167768" "ENSG00000114771" "ENSG00000095970"
## [9] "ENSG00000267368" "ENSG00000203782" "ENSG00000189001" "ENSG00000160883"
## [13] "ENSG00000271367" "ENSG00000284627" "ENSG00000204421" "ENSG00000248871"
## [17] "ENSG00000203618" "ENSG00000243466" "ENSG00000105427" "ENSG00000186458"
## [21] "ENSG00000281899" "ENSG00000235590" "ENSG00000125740" "ENSG00000180818"
## [25] "ENSG00000244193" "ENSG00000110848" "ENSG00000081041" "ENSG00000143816"
## [29] "ENSG00000136457" "ENSG00000134184" "ENSG00000163734" "ENSG00000242288"
## [33] "ENSG00000108342" "ENSG00000138135" "ENSG00000107518" "ENSG00000148677"
## [37] "ENSG00000205362" "ENSG00000007908" "ENSG00000255730" "ENSG00000136244"
write.csv(minMax20_FC, 'min20max20geneFoldchangeMyofascialPain.csv', row.names=F)
Get the file we just wrote here
colnames(minMax20_FC)
## [1] "Gene" "Healthy1" "Healthy2"
## [4] "Healthy3" "Healthy4" "Healthy5"
## [7] "myo1" "myo2" "myo3"
## [10] "myo4" "myo5" "myo6"
## [13] "myo7" "healthy_Mean" "myo_Mean"
## [16] "healthyMedian" "myoMedian" "foldchangeHealthyVsMyo"
Make the data frame but keep gene names in a character vector to make header in new transposed data frame to get samples as rows and the 40 genes as the features and add in a class feature for the sample class where we only have 12 samples of 5 healthy and 7 myofascial.
geneNames <- minMax20_FC$Gene
genes40_DF <- data.frame(t(minMax20_FC[,c(2:13)]))
colnames(genes40_DF) <- geneNames
genes40_DF$class <- c('healthy','healthy','healthy','healthy','healthy','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain')
head(genes40_DF)
Change the class from character to factor to allow confusionMatrix function to work later in this program.
genes40_DF$class <- as.factor(genes40_DF$class)
summary(genes40_DF)
## ENSG00000267727 ENSG00000178372 ENSG00000172867 ENSG00000188508
## Min. : 0.00000 Min. : 0.0000 Min. : 0.00000 Min. : 0.000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.000
## Median : 0.00000 Median : 0.0000 Median : 0.00000 Median : 0.000
## Mean : 16.50957 Mean : 7.4553 Mean : 3.90308 Mean : 3.192
## 3rd Qu.: 0.09357 3rd Qu.: 0.1931 3rd Qu.: 0.01573 3rd Qu.: 0.000
## Max. :196.66379 Max. :79.0971 Max. :41.40356 Max. :30.425
## ENSG00000178934 ENSG00000167768 ENSG00000114771 ENSG00000095970
## Min. : 0.00000 Min. : 0.0000 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 0.00000 Median : 0.0000 Median : 0.000 Median : 0.0000
## Mean : 2.97886 Mean : 2.6479 Mean : 1.733 Mean : 1.5420
## 3rd Qu.: 0.04052 3rd Qu.: 0.3158 3rd Qu.: 0.000 3rd Qu.: 0.0117
## Max. :26.20979 Max. :24.8739 Max. :19.873 Max. :18.0198
## ENSG00000267368 ENSG00000203782 ENSG00000189001 ENSG00000160883
## Min. : 0.0000 Min. : 0.00 Min. :0.000 Min. :0.00000
## 1st Qu.: 0.2323 1st Qu.: 0.00 1st Qu.:0.000 1st Qu.:0.07218
## Median : 3.9112 Median : 0.00 Median :0.000 Median :0.15182
## Mean : 3.8415 Mean : 1.25 Mean :1.088 Mean :1.04723
## 3rd Qu.: 4.8437 3rd Qu.: 0.00 3rd Qu.:0.000 3rd Qu.:0.35011
## Max. :12.8966 Max. :12.68 Max. :9.905 Max. :9.61607
## ENSG00000271367 ENSG00000284627 ENSG00000204421 ENSG00000248871
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.3512 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.3838 Median :0.0000 Median :0.0000
## Mean :0.8291 Mean :0.9975 Mean :0.6937 Mean :0.9914
## 3rd Qu.:0.1984 3rd Qu.:0.6804 3rd Qu.:0.0000 3rd Qu.:0.4922
## Max. :8.9279 Max. :6.4229 Max. :7.2011 Max. :7.5098
## ENSG00000203618 ENSG00000243466 ENSG00000105427 ENSG00000186458
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.09635 Median :0.00000
## Mean :0.6235 Mean :0.6009 Mean :0.62817 Mean :0.55108
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.34734 3rd Qu.:0.01939
## Max. :7.4821 Max. :6.4375 Max. :5.37288 Max. :6.37223
## ENSG00000281899 ENSG00000235590 ENSG00000125740 ENSG00000180818
## Min. : 0.2017 Min. :0.01404 Min. :0.00000 Min. :0.1822
## 1st Qu.: 1.2082 1st Qu.:0.34122 1st Qu.:0.04294 1st Qu.:0.5636
## Median : 2.1032 Median :0.46236 Median :0.06920 Median :2.0618
## Mean : 3.2158 Mean :0.58643 Mean :0.49287 Mean :2.7698
## 3rd Qu.: 4.0849 3rd Qu.:0.83383 3rd Qu.:0.38365 3rd Qu.:4.1641
## Max. :10.3186 Max. :1.39007 Max. :4.14229 Max. :9.1767
## ENSG00000244193 ENSG00000110848 ENSG00000081041 ENSG00000143816
## Min. :0.1126 Min. :0.00000 Min. : 0.4343 Min. :0.03081
## 1st Qu.:0.1698 1st Qu.:0.00000 1st Qu.: 1.5015 1st Qu.:0.31764
## Median :0.2284 Median :0.04472 Median : 3.0891 Median :0.41963
## Mean :1.8863 Mean :0.30633 Mean : 16.2635 Mean :0.50944
## 3rd Qu.:4.3612 3rd Qu.:0.18548 3rd Qu.: 10.9835 3rd Qu.:0.56658
## Max. :6.6017 Max. :2.60310 Max. :140.2944 Max. :1.20174
## ENSG00000136457 ENSG00000134184 ENSG00000163734 ENSG00000242288
## Min. : 0.1767 Min. :0.2122 Min. :0.05608 Min. :0.01604
## 1st Qu.: 0.4568 1st Qu.:0.2489 1st Qu.:0.07749 1st Qu.:0.20070
## Median : 0.7175 Median :0.3456 Median :0.13468 Median :0.47441
## Mean : 3.8902 Mean :3.1536 Mean :0.66453 Mean :0.66045
## 3rd Qu.: 1.3671 3rd Qu.:7.1168 3rd Qu.:0.22441 3rd Qu.:0.94741
## Max. :23.7187 Max. :8.9442 Max. :6.20963 Max. :2.08076
## ENSG00000108342 ENSG00000138135 ENSG00000107518 ENSG00000148677
## Min. : 0.0000 Min. : 0.00000 Min. :0.00000 Min. : 0.8035
## 1st Qu.: 0.1687 1st Qu.: 0.06229 1st Qu.:0.09542 1st Qu.: 3.2329
## Median : 0.3364 Median : 0.21163 Median :0.16875 Median : 4.4409
## Mean : 2.8969 Mean : 2.60864 Mean :0.27688 Mean : 28.7645
## 3rd Qu.: 0.5461 3rd Qu.: 0.33091 3rd Qu.:0.44621 3rd Qu.: 11.0251
## Max. :31.1028 Max. :29.18270 Max. :0.93217 Max. :216.5660
## ENSG00000205362 ENSG00000007908 ENSG00000255730 ENSG00000136244
## Min. : 0.0000 Min. : 0.00000 Min. : 0.00000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.01814 1st Qu.: 0.0838
## Median : 0.4034 Median : 0.04708 Median : 0.03505 Median : 0.1816
## Mean : 13.3179 Mean : 2.81096 Mean : 1.58185 Mean : 3.7752
## 3rd Qu.: 0.6260 3rd Qu.: 0.40651 3rd Qu.: 0.05590 3rd Qu.: 0.2399
## Max. :155.6212 Max. :31.90556 Max. :18.62400 Max. :43.7147
## class
## healthy:5
## myoPain:7
##
##
##
##
Lets just make sure we have the same number in each class sample and only 2 classes say if a typo or extra space was added in making class factor vector.
genes40_DF %>% group_by(class) %>% count(class)
There are only 2 classes and same number in each class that can be inspected visually to see all healthy at top and all myoPain at bottom. Now we can start using some modeling algorithms in the caret package we installed earlier. Lets use boot strapping in this model instead of cross validation as we have to simulate randomness and more samples from independently and identically distributing with replacement. We will use a classifier for ensemble as well as linear modeling.
control_random <- trainControl(method = "boot", number = 50)
metric = "Accuracy"
intrain <- sample(1:12, .8*12)
training <- genes40_DF[intrain,]
testing <- genes40_DF[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitRF <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF <- predict(fitRF, testing)
RF_df <- data.frame(predictRF, type=testing$class)
RF_df
We can see above the model predicted all the myoPain class and only half the healthy class correctly with 2/3 of data correctly predicted.
importance <- varImp(fitRF)
importance
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000180818 100.00
## ENSG00000235590 95.64
## ENSG00000148677 82.96
## ENSG00000163734 76.64
## ENSG00000081041 76.63
## ENSG00000125740 71.99
## ENSG00000138135 66.93
## ENSG00000255730 61.96
## ENSG00000284627 61.13
## ENSG00000108342 60.45
## ENSG00000136457 56.07
## ENSG00000107518 54.98
## ENSG00000007908 53.85
## ENSG00000205362 52.09
## ENSG00000281899 50.94
## ENSG00000134184 49.74
## ENSG00000143816 49.48
## ENSG00000267368 49.19
## ENSG00000136244 48.58
## ENSG00000178372 48.33
plot(importance)
lets compare the values in our top gene above that is ENSG00000081041.
We can see with our small sample data of 12 samples that there is one
sample that is an outlier and much higher than the rest of the
samples.
topGene <- genes40_DF[order(genes40_DF$ENSG00000081041,decreasing =T),]
topGene$ENSG00000081041
## [1] 140.2943774 15.3743775 13.3465208 10.1957661 4.1482004 3.6057617
## [7] 2.5723394 1.7667014 1.6710862 0.9928994 0.7591185 0.4343249
Most all samples are under 20 except one and its in the healthy sample set. Maybe we should use the median values as those are realistic to the true population rather than outliers that can skew the data.That could be another R markdown project to do later.
confusionMatrix(predictRF, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 1 0
## myoPain 1 1
##
## Accuracy : 0.6667
## 95% CI : (0.0943, 0.9916)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.7407
##
## Kappa : 0.4
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.5000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.5000
## Prevalence : 0.6667
## Detection Rate : 0.3333
## Detection Prevalence : 0.3333
## Balanced Accuracy : 0.7500
##
## 'Positive' Class : healthy
##
Our model using bootstrap in random forest an ensemble classifier and regression algorithm scored pretty well at 67% accuracy. There were limited samples here but bootstrapping helps to get a natural set of samples using the Central limit Theorem and other principles from econometrics.
Lets quickly test how increasing to 100 bootstrap samples will change the metric results.
control_random <- trainControl(method = "boot", number = 100)
metric = "Accuracy"
set.seed(567)
fitRF2 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF2 <- predict(fitRF2, testing)
RF_df2 <- data.frame(predictRF2, type=testing$class)
RF_df2
importance2 <- varImp(fitRF2)
importance2
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000235590 100.00
## ENSG00000148677 82.78
## ENSG00000107518 80.10
## ENSG00000081041 79.99
## ENSG00000136457 73.31
## ENSG00000108342 72.65
## ENSG00000163734 67.38
## ENSG00000180818 63.31
## ENSG00000284627 60.97
## ENSG00000138135 55.32
## ENSG00000205362 55.00
## ENSG00000125740 53.90
## ENSG00000242288 52.03
## ENSG00000007908 48.18
## ENSG00000267368 47.47
## ENSG00000143816 47.19
## ENSG00000178372 47.18
## ENSG00000134184 44.52
## ENSG00000136244 42.33
## ENSG00000244193 38.32
plot(importance2)
confusionMatrix(predictRF2, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 1 0
## myoPain 1 1
##
## Accuracy : 0.6667
## 95% CI : (0.0943, 0.9916)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.7407
##
## Kappa : 0.4
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.5000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.5000
## Prevalence : 0.6667
## Detection Rate : 0.3333
## Detection Prevalence : 0.3333
## Balanced Accuracy : 0.7500
##
## 'Positive' Class : healthy
##
The results were the same using 100 bootstrapped samples vs 50. Lets try 1000 bootstrap samples and see if better.
control_random <- trainControl(method = "boot", number = 1000)
metric = "Accuracy"
set.seed(567)
fitRF3 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF3 <- predict(fitRF3, testing)
RF_df3 <- data.frame(predictRF3, type=testing$class)
RF_df3
importance3 <- varImp(fitRF3)
importance3
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000081041 100.00
## ENSG00000180818 71.34
## ENSG00000107518 65.34
## ENSG00000163734 62.75
## ENSG00000138135 60.15
## ENSG00000007908 51.61
## ENSG00000235590 48.77
## ENSG00000136244 48.68
## ENSG00000148677 47.23
## ENSG00000125740 44.05
## ENSG00000284627 38.21
## ENSG00000108342 37.07
## ENSG00000205362 36.52
## ENSG00000281899 36.51
## ENSG00000143816 35.97
## ENSG00000178372 35.20
## ENSG00000134184 34.36
## ENSG00000136457 32.82
## ENSG00000242288 29.43
## ENSG00000255730 29.32
plot(importance3)
confusionMatrix(predictRF3, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 2 0
## myoPain 0 1
##
## Accuracy : 1
## 95% CI : (0.2924, 1)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.2963
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6667
## Detection Rate : 0.6667
## Detection Prevalence : 0.6667
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : healthy
##
The same results in accuracy, sensitivity, and specificity with 1000 bootstrapped samples but is the sample being correctly identified changing with each run from 50, 100, and 1000 resamples? Lets see the data frames we have of predicted versus the healthy class that stays the same.
RF_df
RF_df2
RF_df3
Looks like the predicted class correctly identified the same sample in each run and misclassified the same sample in each run. The random forest runs had some warnings produced and more with the 1000 bootstrapped samples.
Lets get those overall genes from the importance of variables output and write it to csv to add the genomic data to the top genes.
DF <- data.frame(importance[1])
DF
Order in descending order.
DF$Gene <- row.names(DF)
DataTop5 <- DF[,c(2,1)]
row.names(DataTop5) <- NULL
DataTop5genes <- DataTop5[order(DataTop5$Overall, decreasing=T),]
DataTop5genes
GenesTop5 <- DataTop5genes[1:5,]
write.csv(GenesTop5, "top5GenesMyalgia.csv", row.names=F)
GenesTop5
Get the file we just wrote here and we will add the genecards.org summaries to those genes with alternate names if available to understand these top genes that change the most in myalgia chronic pain.
Thank you for viewing this file as we understand more about data science and genes’ RNA-seq data that identifies chronic pain in myalgia sufferers.
We have more to do, like get summary information to at least the top importance genes in this RF bootstrapped model, try the fold change based on median instead to weed out the outliers skewing data in our top importance gene, and try out some other models and algorithms and parameter tuning. We can use a bunch of features in the caret package and get better at analyzing and exploring our data. Plots could be useful with the package ggplot2 that is a good package to use to visualize the data as we work with it.
Note to all, that the results in the general user interface (gui) were much worse than how html knitr output the code chunks with 100% accuracy and perfect summary statistics. But in the rstudio gui, the percentages with the seed set to 567 all produced same results on 50,100, and 1000 bootstrap samples with 67% accuracy, 50% sensitivity, and 100% specificity, but when running the code in html knitr, the results showed a completely different output of 100%. There are some bugs to work out and that may be only in html and not in pdf, but will try now for pdf. The knitr wouldn’t output the file in pdf form due to the confusion matrix factors where ’the data cannot have more levels than the reference.’So, that was changed. Internet search showed that the reason the accuracy was 100% on html and pdf knitr and not in the Gui is because the r chunks have to be changed from cache=TRUE to cache=FALSE. The default is cache=TRUE. I am not sure why it showed 100% because none of my cached confusionMatrix results showed 100%. That actually didn’t work either for html but did work for pdf. The genes are also different on the html version, where here it is one ending in 81041 as most important, in html knitr version it is a different gene entirely and not one in this version’s top 5, it ends in 180818. This is not good. Not sure why that is yet. But the results cannot be reproducible if a bug in the programming interface is preventing reproducibility.
The study mentioned that TNF-a, when injected in rat mammal skeletal muscle, that the pain threshold was lower or less tolerant to mechanical or skeletal muscle pain and also that the inflammatory biomarkers for pain, TNF-a, p-P65, MyoC, and MLCK were elevated. When MEF2C was injected that the inflammatory pain biomarkers other than TNF-A were elevated but the mechanical or skeletal muscle pain decreased by increasing pain threshold or making person more tolerable to pain caused by muscle movement. When injecting Dexmedetomidine or DEX into the skeletal muscle in vivo same as other injections that all biomarkers for pain decreased but didn’t say if pain threshold was decreased or increased. They also tried a natural and obtainable alternative to DEX called maslinic acid which if you look it up is the olive skin wax byproduct of making olive oil. They saw that maslinic acid injected into the rats lowered biomarker inflammatory agents of pain except for TNF-a and that the pain threshold was raised so pain was more tolerable. Assumes the pain was gone if all causes of pain lowered. Since MEF2C affected less pain in myoskeleta muscle and lowered all the biomarkers for pain other than TNF-a related to tumor pain or other pain, they concluded that MEF2C is a biomarker for cases of trigger point myofascial pain or fibromyalgia and that DEX can be used to alleviate pain.
Tempting to find out how to get and get the MEF2C as it would be a great selling marketing tool like ‘pain patch’ transcutaneously delivered pain then relief and sell patches of it or just to put it in something only my brother in law uses that he wouldn’t share with other family members as he would be in so much skeletal muscle pain.
Upon looking up how to get it, the biomart that sells it for $300-500 USD per nanogram, says that they only sell for research purposes and not for diagnostic purposes probably for those wanting it to rule out other pain patterns and narrow it down to fibromyalgic pain. They also said that if they suspect it will not be used in research that they would cancel the order. It would be a great other project to look at MEF2C and to look at CBD oil and see if any similarities as they market CBD for chronic pain, and it could just be smart to put in MEF2C and then the CBD as a two part solutio that experiences the pain you know is there and then use the CBD laced with DEX as well to alleviate it on the spot and then keep coming back for more. It is used in animal testing, in vivo, needs testing on live subjects for comparable results and research facilities making these purchases would likely have to show licenses for research as a university with that license as an IRB or investigative reporting board for research approval and then test these pain substances on monkeys, rats, mice, dogs, cats, etc. Also, maslinic acid, seems like a good therapeutic to alleviate muscle pain and could be diagnostic in that applying it to the triggerpoint area of pain should alleviate pain or lower the tenderness to palpation or ttp score from the highest a 4 that is pain with withdrawal from pain source with light touch to at least a tolerates pain of deep touch as a 1 the lowest score.
When looking at how obtainable DEX is, it is a sedative and probably on a controlled substance list of unobtainable substances available for purchase, possibly narcotics. Unless you are a schizophrenic or bipolar patient with a prescription for it as a salivatory strip for buccal gland in mouth, or as an intravenous injection for sedatory effects involved in patients undergoing surgery. It seems that maslinic acid would be more obtainable. And the side effects of using DEX are lower blood pressure and possibly adverse affects with others using beta blockers as it works on the alpha 2 receptors such as in the kidneys and lungs, because mnemonics for numeric 2 say its those organs in 2, like kidneys and lungs. But you can help yourself to that information. Beta blockers are to enter parasympathetic state and dilate blood vessels in those with high blood pressure. They are in cardiovascular arteries and bronchioles of lungs to constrict in peaceful relaxing breathing controlled by diaphragm.
Myocytes are muslce cells found in skeletal muscles, heart muscles, and smooth muscles of organs like arrecter pilli muscles of hair follicles or gastrointestinal tract muscles involved in gut motility. Alpha 2 is probably the adrenergic affect for opposite restful state and to enter the stressful state of survival like fight, flight, or freeze and appease. I believe it did say that is what the MEF2C does. The pain goes down, it benefits heart myocytes, and although would be nice to give to brother in law for pain, it actually doesn’t cause pain as it increases level of pain and is beneficial for those having suffered heart disease or myocardial infection.
Just curious. But lets look at MEF2C in the original data and see if it shows anything of interest in fold change values.
str(foldchangeHealthyMyo)
## gropd_df [58,302 × 18] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ Gene : chr [1:58302] "ENSG00000282222" "ENSG00000282221" "ENSG00000212040" "ENSG00000110514" ...
## $ Healthy1 : num [1:58302] 0 0 0 1.47 16.22 ...
## $ Healthy2 : num [1:58302] 0 0 0 1.33 7.6 ...
## $ Healthy3 : num [1:58302] 0 0 0 1.36 16.01 ...
## $ Healthy4 : num [1:58302] 0 0 0 1.35 12.6 ...
## $ Healthy5 : num [1:58302] 0 0 0 1.71 8.86 ...
## $ myo1 : num [1:58302] 0 0 0 2.83 7.82 ...
## $ myo2 : num [1:58302] 0 0 0 1.75 10.86 ...
## $ myo3 : num [1:58302] 0 0 0 2.15 13.45 ...
## $ myo4 : num [1:58302] 0 0 0 1.98 11.74 ...
## $ myo5 : num [1:58302] 0 0 0 2.02 12.39 ...
## $ myo6 : num [1:58302] 0 0 0 1.79 11.22 ...
## $ myo7 : num [1:58302] 0.114 0 0 1.537 13.544 ...
## $ healthy_Mean : num [1:58302] 0 0 0 1.47 12.22 ...
## $ myo_Mean : num [1:58302] 0 0 0 2.33 10.32 ...
## $ healthyMedian : num [1:58302] 0 0 0 1.47 12.72 ...
## $ myoMedian : num [1:58302] 0 0 0 2.33 10.32 ...
## $ foldchangeHealthyVsMyo: num [1:58302] 0 0 0 1.581 0.844 ...
## - attr(*, "groups")= tibble [58,302 × 2] (S3: tbl_df/tbl/data.frame)
## ..$ Gene : chr [1:58302] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
## ..$ .rows: list<int> [1:58302]
## .. ..$ : int 38869
## .. ..$ : int 38873
## .. ..$ : int 49706
## .. ..$ : int 18694
## .. ..$ : int 40487
## .. ..$ : int 24716
## .. ..$ : int 53040
## .. ..$ : int 15195
## .. ..$ : int 54061
## .. ..$ : int 26056
## .. ..$ : int 13847
## .. ..$ : int 13846
## .. ..$ : int 37653
## .. ..$ : int 1644
## .. ..$ : int 7911
## .. ..$ : int 44718
## .. ..$ : int 44721
## .. ..$ : int 22888
## .. ..$ : int 22889
## .. ..$ : int 47075
## .. ..$ : int 31939
## .. ..$ : int 23289
## .. ..$ : int 30334
## .. ..$ : int 306
## .. ..$ : int 305
## .. ..$ : int 39853
## .. ..$ : int 54981
## .. ..$ : int 54980
## .. ..$ : int 21768
## .. ..$ : int 43599
## .. ..$ : int 25384
## .. ..$ : int 10355
## .. ..$ : int 37504
## .. ..$ : int 18845
## .. ..$ : int 27843
## .. ..$ : int 26169
## .. ..$ : int 46396
## .. ..$ : int 53603
## .. ..$ : int 33348
## .. ..$ : int 33346
## .. ..$ : int 26783
## .. ..$ : int 51440
## .. ..$ : int 14013
## .. ..$ : int 44132
## .. ..$ : int 44126
## .. ..$ : int 53461
## .. ..$ : int 38371
## .. ..$ : int 13815
## .. ..$ : int 54201
## .. ..$ : int 56622
## .. ..$ : int 19688
## .. ..$ : int 41503
## .. ..$ : int 8817
## .. ..$ : int 40744
## .. ..$ : int 27517
## .. ..$ : int 24683
## .. ..$ : int 39933
## .. ..$ : int 18156
## .. ..$ : int 18157
## .. ..$ : int 18149
## .. ..$ : int 28784
## .. ..$ : int 14644
## .. ..$ : int 47779
## .. ..$ : int 13035
## .. ..$ : int 13029
## .. ..$ : int 56418
## .. ..$ : int 56420
## .. ..$ : int 6590
## .. ..$ : int 16740
## .. ..$ : int 53530
## .. ..$ : int 10193
## .. ..$ : int 47006
## .. ..$ : int 14670
## .. ..$ : int 14673
## .. ..$ : int 56514
## .. ..$ : int 58051
## .. ..$ : int 23139
## .. ..$ : int 8139
## .. ..$ : int 8140
## .. ..$ : int 32639
## .. ..$ : int 32640
## .. ..$ : int 32642
## .. ..$ : int 23978
## .. ..$ : int 39228
## .. ..$ : int 34797
## .. ..$ : int 34790
## .. ..$ : int 56429
## .. ..$ : int 45456
## .. ..$ : int 38881
## .. ..$ : int 15168
## .. ..$ : int 15172
## .. ..$ : int 5029
## .. ..$ : int 11482
## .. ..$ : int 39708
## .. ..$ : int 18010
## .. ..$ : int 52838
## .. ..$ : int 31078
## .. ..$ : int 37662
## .. ..$ : int 22460
## .. .. [list output truncated]
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
The names or IDs for the genes are in Ensebml format. We can look up the Ensembl gene ID for MEF2C in genecards.org. The Ensemble ID is ENSG00000081189. Lets see the fold change for this gene.
MEF2C <- foldchangeHealthyMyo[which(foldchangeHealthyMyo$Gene=="ENSG00000081189"),]
MEF2C
MEF2C seems higher in the healthy samples and about the same but with some lower in expression value in the myofascial pain samples. The foldchange of pathology over healthy for fibromyalgia compared to healthy is 47.7 % MEF2C in myofascial or fibromyalgia pain samples compared to healthy samples. The study injected the MEF2C into healthy patients and compared to the fibromyalgia patients. They found that the pain biomarkers including TNF-a were all elevated and a higher tolerance to pain or ability to handle more pain was noticed even though it increased biomarkers for pain other than TNF-a.
Lets see if we can find the Ensembl ID for each of the given pain inflammatory biomarkers of TNF-a, p-P65, MyoC, and MLCK. TNF-a is ENSG00000232810, p65 is ENSG00000173039, MyoC is ENSG00000034971, and MLCK as MYLK new name ENSG00000065534. And lets use our MEF2C of ENSG00000081189.
geneID <- c("MEF2C","TNF-a","p65","MyoC","MLCK")
EnsembleID <- c("ENSG00000081189","ENSG00000232810","ENSG00000173039","ENSG00000034971","ENSG00000065534")
painGenes <- foldchangeHealthyMyo[foldchangeHealthyMyo$Gene %in% EnsembleID,]
painGenes$geneCardID <- geneID
painMarkers <- painGenes[,c(1,19,2:18)]
painMarkers
Indeed everything is lowered for pain inflammatory biomarkers except for TNF-a. So these samples we have analyzed must be after DEX is applied to the MEF2C applied samples of healthy and fibromyalgia patient samples in skeletal muscle.
Thanks again!