knitr::opts_chunk$set(cache = FALSE)

This markdown file for R is about myofascial pain and taken from the GSE215154 series data that found high throughput sequencing fastp data of 5 healthy samples compared to 7 samples of patients with triggerpoint myofascial pain. A search of the GSE215154 series data at NCBI gene studies will bring you to the series.

I manually downloaded each text file of the samples and the 7 text files for trigger point myofascial pain. This was the most recent study evaluating fibromyalgia patient genomic data I found a month or so ago and am now reviewing it in searching for top genes in myofascial pain or chronic pain patterns, mononucleosis, multiple sclerosis, Hodgkins disease, and Epstein-Barr infections that relate back to the common denominator at some point of life for other pathologies listed. This data science project focuses on chronic pain in those with myofascial pain patterns similar to fibromyalgia that currently does not have any diagnostic gene or associated genes or special lab testing to confirm disease. Currently only a questionnaire is available and monitoring over about 3 months while ruling out autoimmune disorders to diagnose fibromyalgia as a source of pain out of normal threshold values. This disorder was previously an ‘in your head’ disease that females got mostly and doctors would tell them to take OTC for pain and now get lifestyle in check to improve diet, sleep, stress, exercise, massage, adjustments, and meditation.

Lets read in the sample text files separately and combine them into one data frame. The sample information says that the aquisition of the data used a count of the fragments of each gene, then normalized it, and also added in the FPKM values as a final step. These are tab delimited files. When reading in the file per sample you see the gene as well as the count (of fragments then normalized) and FPKM value per gene. Internet search says the normalization is the FPKM values for the counts of fragments in RNA-seq data of transcriptional gene data using a fragment length. The gene length is multiplied by 10 billion or 10^9= 10,000,000,000, then divided by the transcripts of gene X gene length. More Information on FPKM is fragments per kilo million and TPM is transcripts per million and CPM is counts per million.

Lets load in our packages first to use later.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
healthy1 <- read.table('GSM6623971_C1.txt', header=T, sep="")
summary(healthy1)
##    gene_id             C1_FPKM             C1_count       
##  Length:58302       Min.   :0.000e+00   Min.   :     0.0  
##  Class :character   1st Qu.:0.000e+00   1st Qu.:     0.0  
##  Mode  :character   Median :0.000e+00   Median :     0.0  
##                     Mean   :5.472e+00   Mean   :   197.1  
##                     3rd Qu.:3.379e-01   3rd Qu.:    10.0  
##                     Max.   :3.058e+04   Max.   :754337.0
head(healthy1)

Read in the other 4 healthy samples as tables

healthy2 <- read.table('GSM6623972_C2.txt', header=T, sep="")
healthy3 <- read.table('GSM6623973_C3.txt', header=T, sep="")
healthy4 <- read.table('GSM6623974_C4.txt', header=T, sep="")
healthy5 <- read.table('GSM6623975_C5.txt', header=T, sep="")

Lets column bind them as they all have the same number of rows and columns, but lets remove one of the variables, the counts, and keep FPKM or fragments per kilomillion of each gene. And remove the other gene columns other than the first table.

colnames(healthy1)[2] <- 'healthy1-FPKM'
colnames(healthy2)[2] <- 'healthy2-FPKM'
colnames(healthy3)[2] <- 'healthy3-FPKM'
colnames(healthy4)[2] <- 'healthy4-FPKM'
colnames(healthy5)[2] <- 'healthy5-FPKM'
healthy5samples <- cbind(healthy1[,-3], healthy2[,-c(1,3)],healthy3[,-c(1,3)], healthy4[,-c(1,3)], healthy5[,-c(1,3)])

colnames(healthy5samples) <- c('Gene','Healthy1','Healthy2','Healthy3','Healthy4','Healthy5')
head(healthy5samples)

Lets do the same with the myofacial 7 samples.

myofascial1 <- read.table('GSM6623976_M1.txt', header=T, sep="")
myofascial2 <- read.table('GSM6623977_M2.txt', header=T, sep="")
myofascial3 <- read.table('GSM6623978_M3.txt', header=T, sep="")
myofascial4 <- read.table('GSM6623979_M4.txt', header=T, sep="")
myofascial5 <- read.table('GSM6623980_M5.txt', header=T, sep="")
myofascial6 <- read.table('GSM6623981_M6.txt', header=T, sep="")
myofascial7 <- read.table('GSM6623982_M7.txt', header=T, sep="")
colnames(myofascial1)[2] <- 'myofascial1-FPKM'
colnames(myofascial2)[2] <- 'myofascial2-FPKM'
colnames(myofascial3)[2] <- 'myofascial3-FPKM'
colnames(myofascial4)[2] <- 'myofascial4-FPKM'
colnames(myofascial5)[2] <- 'myofascial5-FPKM'
colnames(myofascial6)[2] <- 'myofascial6-FPKM'
colnames(myofascial7)[2] <- 'myofascial7-FPKM'
myofascial7samples <- cbind(myofascial1[,-3], myofascial2[,-c(1,3)],myofascial3[,-c(1,3)], myofascial4[,-c(1,3)], myofascial5[,-c(1,3)], myofascial6[,-c(1,3)], myofascial7[,-c(1,3)])

colnames(myofascial7samples) <- c('Gene','myo1','myo2','myo3','myo4','myo5','myo6','myo7')
head(myofascial7samples)

Lets column bind the healthy 5 samples to the 7 myofascial pain samples with the Gene order they all have the same number of rows of.

dataHealthy5Myo7 <- cbind(healthy5samples,myofascial7samples[,-1])
head(dataHealthy5Myo7)

Summarize the data frame just made to find the mean of each sample’s normalized FPKM values within the 58,302 genes.

summary(dataHealthy5Myo7)
##      Gene              Healthy1            Healthy2            Healthy3        
##  Length:58302       Min.   :0.000e+00   Min.   :0.000e+00   Min.   :    0.000  
##  Class :character   1st Qu.:0.000e+00   1st Qu.:0.000e+00   1st Qu.:    0.000  
##  Mode  :character   Median :0.000e+00   Median :0.000e+00   Median :    0.000  
##                     Mean   :5.472e+00   Mean   :5.188e+00   Mean   :    5.447  
##                     3rd Qu.:3.379e-01   3rd Qu.:6.688e-01   3rd Qu.:    0.464  
##                     Max.   :3.058e+04   Max.   :2.376e+04   Max.   :50031.764  
##     Healthy4            Healthy5              myo1          
##  Min.   :    0.000   Min.   :    0.000   Min.   :0.000e+00  
##  1st Qu.:    0.000   1st Qu.:    0.000   1st Qu.:0.000e+00  
##  Median :    0.000   Median :    0.000   Median :0.000e+00  
##  Mean   :    5.764   Mean   :    5.556   Mean   :4.861e+00  
##  3rd Qu.:    0.426   3rd Qu.:    0.467   3rd Qu.:7.898e-01  
##  Max.   :37775.449   Max.   :39829.283   Max.   :1.542e+04  
##       myo2                myo3                myo4          
##  Min.   :    0.000   Min.   :    0.000   Min.   :    0.000  
##  1st Qu.:    0.000   1st Qu.:    0.000   1st Qu.:    0.000  
##  Median :    0.000   Median :    0.000   Median :    0.000  
##  Mean   :    5.766   Mean   :    5.738   Mean   :    5.113  
##  3rd Qu.:    0.443   3rd Qu.:    0.309   3rd Qu.:    0.376  
##  Max.   :43180.511   Max.   :40626.886   Max.   :37108.454  
##       myo5                myo6                myo7          
##  Min.   :    0.000   Min.   :    0.000   Min.   :    0.000  
##  1st Qu.:    0.000   1st Qu.:    0.000   1st Qu.:    0.000  
##  Median :    0.000   Median :    0.000   Median :    0.000  
##  Mean   :    5.415   Mean   :    5.175   Mean   :    5.864  
##  3rd Qu.:    0.341   3rd Qu.:    0.371   3rd Qu.:    0.377  
##  Max.   :41750.951   Max.   :37369.240   Max.   :41902.286
dim(dataHealthy5Myo7)
## [1] 58302    13
write.csv(dataHealthy5Myo7, 'fragmentsPerKMill_myofascial_5healthy_7myo_58kgenes.csv',row.names=F)

Get this file we made here file

Lets get the sample mean per gene in healthy and myofascial pain samples to compare their fold change values.

healthyMeans <- dataHealthy5Myo7 %>% group_by(Gene) %>%
mutate(
healthy_Mean = mean(Healthy1:Healthy2,na.rm=T),
myo_Mean = mean(myo1:myo7)
)

colnames(healthyMeans)
##  [1] "Gene"         "Healthy1"     "Healthy2"     "Healthy3"     "Healthy4"    
##  [6] "Healthy5"     "myo1"         "myo2"         "myo3"         "myo4"        
## [11] "myo5"         "myo6"         "myo7"         "healthy_Mean" "myo_Mean"

Now get the Median of each sample group.

healthyMeansMedians <- healthyMeans %>% group_by(Gene) %>% mutate(healthyMedian=median(Healthy1:Healthy5, na.rm=T), myoMedian = median(myo1:myo7))

colnames(healthyMeansMedians)
##  [1] "Gene"          "Healthy1"      "Healthy2"      "Healthy3"     
##  [5] "Healthy4"      "Healthy5"      "myo1"          "myo2"         
##  [9] "myo3"          "myo4"          "myo5"          "myo6"         
## [13] "myo7"          "healthy_Mean"  "myo_Mean"      "healthyMedian"
## [17] "myoMedian"
head(healthyMeansMedians)

Looks like the median is the same for mean in the two samples from the first 5 entries in the data. Lets get the fold change of myofascial pain to healthy gene expression FPKM normalized values.

foldchangeHealthyMyo <- healthyMeansMedians %>% mutate(foldchangeHealthyVsMyo=(myo_Mean/(healthy_Mean+0.0000000000001)))

head(foldchangeHealthyMyo)

To avoid the NaN values of healthy 0 in denominator, I added a very very small value of one trillionth of 1. Now lets order the data by fold change from most to least and take the top 20 and bottom 20 genes and make a data frame to do some machine learning on with our 2 classes of healthy and myofascial pain.

FC_ordered <- foldchangeHealthyMyo[order(foldchangeHealthyMyo$foldchangeHealthyVsMyo, decreasing=T),]

head(FC_ordered)
tail(FC_ordered)

There may be a very vast number of genes with a zero value, so the bottom 20 won’t be revealing anything for the gene expression down regulation in myofascial pain.

sum(FC_ordered$foldchangeHealthyVsMyo==0)
## [1] 32916

There are 32,916 genes equal to zero in fold change. Since these values are normalized fragments per kilomillion, we will have to look at the bottom values above 0 that are lowest.

sum(FC_ordered$foldchangeHealthyVsMyo < 0)
## [1] 0

There are not any negative values. but more than 32,000 genes with 0 fold change, won’t give much information if ordering in decreasing order and last 32,000 genes have no information for fold change.

minFC <- subset(FC_ordered, FC_ordered$foldchangeHealthyVsMyo > 0)

tail(minFC)

We removed those 32,000 genes that had no information to add in fold change values for this RNA-seq data, and still have 25,386 genes left. We can now take the bottom 20 and top 20 to make a data frame to use in training a model to predict the best in our 2 classes of healthy or myofascial pain.

minMax20_FC <- minFC[c(1:20,25367:25386),]
dim(minMax20_FC)
## [1] 40 18

That gives us 40 genes to work with. Lets see those genes.These are Ensemble genes. They may or may not have alternate Genecard.org gene names but should have a summary on Genecards.org that we can read all about once we get our top genes.

minMax20_FC$Gene
##  [1] "ENSG00000267727" "ENSG00000178372" "ENSG00000172867" "ENSG00000188508"
##  [5] "ENSG00000178934" "ENSG00000167768" "ENSG00000114771" "ENSG00000095970"
##  [9] "ENSG00000267368" "ENSG00000203782" "ENSG00000189001" "ENSG00000160883"
## [13] "ENSG00000271367" "ENSG00000284627" "ENSG00000204421" "ENSG00000248871"
## [17] "ENSG00000203618" "ENSG00000243466" "ENSG00000105427" "ENSG00000186458"
## [21] "ENSG00000281899" "ENSG00000235590" "ENSG00000125740" "ENSG00000180818"
## [25] "ENSG00000244193" "ENSG00000110848" "ENSG00000081041" "ENSG00000143816"
## [29] "ENSG00000136457" "ENSG00000134184" "ENSG00000163734" "ENSG00000242288"
## [33] "ENSG00000108342" "ENSG00000138135" "ENSG00000107518" "ENSG00000148677"
## [37] "ENSG00000205362" "ENSG00000007908" "ENSG00000255730" "ENSG00000136244"
write.csv(minMax20_FC, 'min20max20geneFoldchangeMyofascialPain.csv', row.names=F)

Get the file we just wrote here

colnames(minMax20_FC)
##  [1] "Gene"                   "Healthy1"               "Healthy2"              
##  [4] "Healthy3"               "Healthy4"               "Healthy5"              
##  [7] "myo1"                   "myo2"                   "myo3"                  
## [10] "myo4"                   "myo5"                   "myo6"                  
## [13] "myo7"                   "healthy_Mean"           "myo_Mean"              
## [16] "healthyMedian"          "myoMedian"              "foldchangeHealthyVsMyo"

Make the data frame but keep gene names in a character vector to make header in new transposed data frame to get samples as rows and the 40 genes as the features and add in a class feature for the sample class where we only have 12 samples of 5 healthy and 7 myofascial.

geneNames <- minMax20_FC$Gene

genes40_DF <- data.frame(t(minMax20_FC[,c(2:13)]))


colnames(genes40_DF) <- geneNames

genes40_DF$class <- c('healthy','healthy','healthy','healthy','healthy','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain')

head(genes40_DF)

Change the class from character to factor to allow confusionMatrix function to work later in this program.

genes40_DF$class <- as.factor(genes40_DF$class)
summary(genes40_DF)
##  ENSG00000267727     ENSG00000178372   ENSG00000172867    ENSG00000188508 
##  Min.   :  0.00000   Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.000  
##  1st Qu.:  0.00000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.000  
##  Median :  0.00000   Median : 0.0000   Median : 0.00000   Median : 0.000  
##  Mean   : 16.50957   Mean   : 7.4553   Mean   : 3.90308   Mean   : 3.192  
##  3rd Qu.:  0.09357   3rd Qu.: 0.1931   3rd Qu.: 0.01573   3rd Qu.: 0.000  
##  Max.   :196.66379   Max.   :79.0971   Max.   :41.40356   Max.   :30.425  
##  ENSG00000178934    ENSG00000167768   ENSG00000114771  ENSG00000095970  
##  Min.   : 0.00000   Min.   : 0.0000   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 0.00000   1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median : 0.00000   Median : 0.0000   Median : 0.000   Median : 0.0000  
##  Mean   : 2.97886   Mean   : 2.6479   Mean   : 1.733   Mean   : 1.5420  
##  3rd Qu.: 0.04052   3rd Qu.: 0.3158   3rd Qu.: 0.000   3rd Qu.: 0.0117  
##  Max.   :26.20979   Max.   :24.8739   Max.   :19.873   Max.   :18.0198  
##  ENSG00000267368   ENSG00000203782 ENSG00000189001 ENSG00000160883  
##  Min.   : 0.0000   Min.   : 0.00   Min.   :0.000   Min.   :0.00000  
##  1st Qu.: 0.2323   1st Qu.: 0.00   1st Qu.:0.000   1st Qu.:0.07218  
##  Median : 3.9112   Median : 0.00   Median :0.000   Median :0.15182  
##  Mean   : 3.8415   Mean   : 1.25   Mean   :1.088   Mean   :1.04723  
##  3rd Qu.: 4.8437   3rd Qu.: 0.00   3rd Qu.:0.000   3rd Qu.:0.35011  
##  Max.   :12.8966   Max.   :12.68   Max.   :9.905   Max.   :9.61607  
##  ENSG00000271367  ENSG00000284627  ENSG00000204421  ENSG00000248871 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.3512   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.3838   Median :0.0000   Median :0.0000  
##  Mean   :0.8291   Mean   :0.9975   Mean   :0.6937   Mean   :0.9914  
##  3rd Qu.:0.1984   3rd Qu.:0.6804   3rd Qu.:0.0000   3rd Qu.:0.4922  
##  Max.   :8.9279   Max.   :6.4229   Max.   :7.2011   Max.   :7.5098  
##  ENSG00000203618  ENSG00000243466  ENSG00000105427   ENSG00000186458  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.09635   Median :0.00000  
##  Mean   :0.6235   Mean   :0.6009   Mean   :0.62817   Mean   :0.55108  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.34734   3rd Qu.:0.01939  
##  Max.   :7.4821   Max.   :6.4375   Max.   :5.37288   Max.   :6.37223  
##  ENSG00000281899   ENSG00000235590   ENSG00000125740   ENSG00000180818 
##  Min.   : 0.2017   Min.   :0.01404   Min.   :0.00000   Min.   :0.1822  
##  1st Qu.: 1.2082   1st Qu.:0.34122   1st Qu.:0.04294   1st Qu.:0.5636  
##  Median : 2.1032   Median :0.46236   Median :0.06920   Median :2.0618  
##  Mean   : 3.2158   Mean   :0.58643   Mean   :0.49287   Mean   :2.7698  
##  3rd Qu.: 4.0849   3rd Qu.:0.83383   3rd Qu.:0.38365   3rd Qu.:4.1641  
##  Max.   :10.3186   Max.   :1.39007   Max.   :4.14229   Max.   :9.1767  
##  ENSG00000244193  ENSG00000110848   ENSG00000081041    ENSG00000143816  
##  Min.   :0.1126   Min.   :0.00000   Min.   :  0.4343   Min.   :0.03081  
##  1st Qu.:0.1698   1st Qu.:0.00000   1st Qu.:  1.5015   1st Qu.:0.31764  
##  Median :0.2284   Median :0.04472   Median :  3.0891   Median :0.41963  
##  Mean   :1.8863   Mean   :0.30633   Mean   : 16.2635   Mean   :0.50944  
##  3rd Qu.:4.3612   3rd Qu.:0.18548   3rd Qu.: 10.9835   3rd Qu.:0.56658  
##  Max.   :6.6017   Max.   :2.60310   Max.   :140.2944   Max.   :1.20174  
##  ENSG00000136457   ENSG00000134184  ENSG00000163734   ENSG00000242288  
##  Min.   : 0.1767   Min.   :0.2122   Min.   :0.05608   Min.   :0.01604  
##  1st Qu.: 0.4568   1st Qu.:0.2489   1st Qu.:0.07749   1st Qu.:0.20070  
##  Median : 0.7175   Median :0.3456   Median :0.13468   Median :0.47441  
##  Mean   : 3.8902   Mean   :3.1536   Mean   :0.66453   Mean   :0.66045  
##  3rd Qu.: 1.3671   3rd Qu.:7.1168   3rd Qu.:0.22441   3rd Qu.:0.94741  
##  Max.   :23.7187   Max.   :8.9442   Max.   :6.20963   Max.   :2.08076  
##  ENSG00000108342   ENSG00000138135    ENSG00000107518   ENSG00000148677   
##  Min.   : 0.0000   Min.   : 0.00000   Min.   :0.00000   Min.   :  0.8035  
##  1st Qu.: 0.1687   1st Qu.: 0.06229   1st Qu.:0.09542   1st Qu.:  3.2329  
##  Median : 0.3364   Median : 0.21163   Median :0.16875   Median :  4.4409  
##  Mean   : 2.8969   Mean   : 2.60864   Mean   :0.27688   Mean   : 28.7645  
##  3rd Qu.: 0.5461   3rd Qu.: 0.33091   3rd Qu.:0.44621   3rd Qu.: 11.0251  
##  Max.   :31.1028   Max.   :29.18270   Max.   :0.93217   Max.   :216.5660  
##  ENSG00000205362    ENSG00000007908    ENSG00000255730    ENSG00000136244  
##  Min.   :  0.0000   Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0000  
##  1st Qu.:  0.0000   1st Qu.: 0.00000   1st Qu.: 0.01814   1st Qu.: 0.0838  
##  Median :  0.4034   Median : 0.04708   Median : 0.03505   Median : 0.1816  
##  Mean   : 13.3179   Mean   : 2.81096   Mean   : 1.58185   Mean   : 3.7752  
##  3rd Qu.:  0.6260   3rd Qu.: 0.40651   3rd Qu.: 0.05590   3rd Qu.: 0.2399  
##  Max.   :155.6212   Max.   :31.90556   Max.   :18.62400   Max.   :43.7147  
##      class  
##  healthy:5  
##  myoPain:7  
##             
##             
##             
## 

Lets just make sure we have the same number in each class sample and only 2 classes say if a typo or extra space was added in making class factor vector.

genes40_DF %>% group_by(class) %>% count(class)

There are only 2 classes and same number in each class that can be inspected visually to see all healthy at top and all myoPain at bottom. Now we can start using some modeling algorithms in the caret package we installed earlier. Lets use boot strapping in this model instead of cross validation as we have to simulate randomness and more samples from independently and identically distributing with replacement. We will use a classifier for ensemble as well as linear modeling.

control_random <- trainControl(method = "boot", number = 50)  
metric = "Accuracy"
intrain <- sample(1:12, .8*12)
training <- genes40_DF[intrain,]
testing <- genes40_DF[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitRF <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF <- predict(fitRF, testing)
RF_df <- data.frame(predictRF, type=testing$class)
RF_df

We can see above the model predicted all the myoPain class and only half the healthy class correctly with 2/3 of data correctly predicted.

importance <- varImp(fitRF)
importance
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000180818  100.00
## ENSG00000235590   95.64
## ENSG00000148677   82.96
## ENSG00000163734   76.64
## ENSG00000081041   76.63
## ENSG00000125740   71.99
## ENSG00000138135   66.93
## ENSG00000255730   61.96
## ENSG00000284627   61.13
## ENSG00000108342   60.45
## ENSG00000136457   56.07
## ENSG00000107518   54.98
## ENSG00000007908   53.85
## ENSG00000205362   52.09
## ENSG00000281899   50.94
## ENSG00000134184   49.74
## ENSG00000143816   49.48
## ENSG00000267368   49.19
## ENSG00000136244   48.58
## ENSG00000178372   48.33
plot(importance)

lets compare the values in our top gene above that is ENSG00000081041. We can see with our small sample data of 12 samples that there is one sample that is an outlier and much higher than the rest of the samples.

topGene <- genes40_DF[order(genes40_DF$ENSG00000081041,decreasing =T),]
topGene$ENSG00000081041
##  [1] 140.2943774  15.3743775  13.3465208  10.1957661   4.1482004   3.6057617
##  [7]   2.5723394   1.7667014   1.6710862   0.9928994   0.7591185   0.4343249

Most all samples are under 20 except one and its in the healthy sample set. Maybe we should use the median values as those are realistic to the true population rather than outliers that can skew the data.That could be another R markdown project to do later.

confusionMatrix(predictRF, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       1       0
##    myoPain       1       1
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.0943, 0.9916)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.7407          
##                                           
##                   Kappa : 0.4             
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.6667          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.3333          
##       Balanced Accuracy : 0.7500          
##                                           
##        'Positive' Class : healthy         
## 

Our model using bootstrap in random forest an ensemble classifier and regression algorithm scored pretty well at 67% accuracy. There were limited samples here but bootstrapping helps to get a natural set of samples using the Central limit Theorem and other principles from econometrics.

Lets quickly test how increasing to 100 bootstrap samples will change the metric results.

control_random <- trainControl(method = "boot", number = 100)  
metric = "Accuracy"

set.seed(567)

fitRF2 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF2 <- predict(fitRF2, testing)
RF_df2 <- data.frame(predictRF2, type=testing$class)
RF_df2
importance2 <- varImp(fitRF2)
importance2
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000235590  100.00
## ENSG00000148677   82.78
## ENSG00000107518   80.10
## ENSG00000081041   79.99
## ENSG00000136457   73.31
## ENSG00000108342   72.65
## ENSG00000163734   67.38
## ENSG00000180818   63.31
## ENSG00000284627   60.97
## ENSG00000138135   55.32
## ENSG00000205362   55.00
## ENSG00000125740   53.90
## ENSG00000242288   52.03
## ENSG00000007908   48.18
## ENSG00000267368   47.47
## ENSG00000143816   47.19
## ENSG00000178372   47.18
## ENSG00000134184   44.52
## ENSG00000136244   42.33
## ENSG00000244193   38.32
plot(importance2)

confusionMatrix(predictRF2, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       1       0
##    myoPain       1       1
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.0943, 0.9916)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.7407          
##                                           
##                   Kappa : 0.4             
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.6667          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.3333          
##       Balanced Accuracy : 0.7500          
##                                           
##        'Positive' Class : healthy         
## 

The results were the same using 100 bootstrapped samples vs 50. Lets try 1000 bootstrap samples and see if better.

control_random <- trainControl(method = "boot", number = 1000)  
metric = "Accuracy"

set.seed(567)

fitRF3 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF3 <- predict(fitRF3, testing)
RF_df3 <- data.frame(predictRF3, type=testing$class)
RF_df3
importance3 <- varImp(fitRF3)
importance3
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000081041  100.00
## ENSG00000180818   71.34
## ENSG00000107518   65.34
## ENSG00000163734   62.75
## ENSG00000138135   60.15
## ENSG00000007908   51.61
## ENSG00000235590   48.77
## ENSG00000136244   48.68
## ENSG00000148677   47.23
## ENSG00000125740   44.05
## ENSG00000284627   38.21
## ENSG00000108342   37.07
## ENSG00000205362   36.52
## ENSG00000281899   36.51
## ENSG00000143816   35.97
## ENSG00000178372   35.20
## ENSG00000134184   34.36
## ENSG00000136457   32.82
## ENSG00000242288   29.43
## ENSG00000255730   29.32
plot(importance3)

confusionMatrix(predictRF3, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       2       0
##    myoPain       0       1
##                                      
##                Accuracy : 1          
##                  95% CI : (0.2924, 1)
##     No Information Rate : 0.6667     
##     P-Value [Acc > NIR] : 0.2963     
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6667     
##          Detection Rate : 0.6667     
##    Detection Prevalence : 0.6667     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : healthy    
## 

The same results in accuracy, sensitivity, and specificity with 1000 bootstrapped samples but is the sample being correctly identified changing with each run from 50, 100, and 1000 resamples? Lets see the data frames we have of predicted versus the healthy class that stays the same.

RF_df
RF_df2
RF_df3

Looks like the predicted class correctly identified the same sample in each run and misclassified the same sample in each run. The random forest runs had some warnings produced and more with the 1000 bootstrapped samples.

Lets get those overall genes from the importance of variables output and write it to csv to add the genomic data to the top genes.

DF <- data.frame(importance[1])

DF

Order in descending order.

DF$Gene <- row.names(DF)
DataTop5 <- DF[,c(2,1)]
row.names(DataTop5) <- NULL
DataTop5genes <- DataTop5[order(DataTop5$Overall, decreasing=T),]
DataTop5genes
GenesTop5 <- DataTop5genes[1:5,]

write.csv(GenesTop5, "top5GenesMyalgia.csv", row.names=F)

GenesTop5

Get the file we just wrote here and we will add the genecards.org summaries to those genes with alternate names if available to understand these top genes that change the most in myalgia chronic pain.

Thank you for viewing this file as we understand more about data science and genes’ RNA-seq data that identifies chronic pain in myalgia sufferers.

We have more to do, like get summary information to at least the top importance genes in this RF bootstrapped model, try the fold change based on median instead to weed out the outliers skewing data in our top importance gene, and try out some other models and algorithms and parameter tuning. We can use a bunch of features in the caret package and get better at analyzing and exploring our data. Plots could be useful with the package ggplot2 that is a good package to use to visualize the data as we work with it.

Note to all, that the results in the general user interface (gui) were much worse than how html knitr output the code chunks with 100% accuracy and perfect summary statistics. But in the rstudio gui, the percentages with the seed set to 567 all produced same results on 50,100, and 1000 bootstrap samples with 67% accuracy, 50% sensitivity, and 100% specificity, but when running the code in html knitr, the results showed a completely different output of 100%. There are some bugs to work out and that may be only in html and not in pdf, but will try now for pdf. The knitr wouldn’t output the file in pdf form due to the confusion matrix factors where ’the data cannot have more levels than the reference.’So, that was changed. Internet search showed that the reason the accuracy was 100% on html and pdf knitr and not in the Gui is because the r chunks have to be changed from cache=TRUE to cache=FALSE. The default is cache=TRUE. I am not sure why it showed 100% because none of my cached confusionMatrix results showed 100%. That actually didn’t work either for html but did work for pdf. The genes are also different on the html version, where here it is one ending in 81041 as most important, in html knitr version it is a different gene entirely and not one in this version’s top 5, it ends in 180818. This is not good. Not sure why that is yet. But the results cannot be reproducible if a bug in the programming interface is preventing reproducibility.

The study mentioned that TNF-a, when injected in rat mammal skeletal muscle, that the pain threshold was lower or less tolerant to mechanical or skeletal muscle pain and also that the inflammatory biomarkers for pain, TNF-a, p-P65, MyoC, and MLCK were elevated. When MEF2C was injected that the inflammatory pain biomarkers other than TNF-A were elevated but the mechanical or skeletal muscle pain decreased by increasing pain threshold or making person more tolerable to pain caused by muscle movement. When injecting Dexmedetomidine or DEX into the skeletal muscle in vivo same as other injections that all biomarkers for pain decreased but didn’t say if pain threshold was decreased or increased. They also tried a natural and obtainable alternative to DEX called maslinic acid which if you look it up is the olive skin wax byproduct of making olive oil. They saw that maslinic acid injected into the rats lowered biomarker inflammatory agents of pain except for TNF-a and that the pain threshold was raised so pain was more tolerable. Assumes the pain was gone if all causes of pain lowered. Since MEF2C affected less pain in myoskeleta muscle and lowered all the biomarkers for pain other than TNF-a related to tumor pain or other pain, they concluded that MEF2C is a biomarker for cases of trigger point myofascial pain or fibromyalgia and that DEX can be used to alleviate pain.

Tempting to find out how to get and get the MEF2C as it would be a great selling marketing tool like ‘pain patch’ transcutaneously delivered pain then relief and sell patches of it or just to put it in something only my brother in law uses that he wouldn’t share with other family members as he would be in so much skeletal muscle pain.

Upon looking up how to get it, the biomart that sells it for $300-500 USD per nanogram, says that they only sell for research purposes and not for diagnostic purposes probably for those wanting it to rule out other pain patterns and narrow it down to fibromyalgic pain. They also said that if they suspect it will not be used in research that they would cancel the order. It would be a great other project to look at MEF2C and to look at CBD oil and see if any similarities as they market CBD for chronic pain, and it could just be smart to put in MEF2C and then the CBD as a two part solutio that experiences the pain you know is there and then use the CBD laced with DEX as well to alleviate it on the spot and then keep coming back for more. It is used in animal testing, in vivo, needs testing on live subjects for comparable results and research facilities making these purchases would likely have to show licenses for research as a university with that license as an IRB or investigative reporting board for research approval and then test these pain substances on monkeys, rats, mice, dogs, cats, etc. Also, maslinic acid, seems like a good therapeutic to alleviate muscle pain and could be diagnostic in that applying it to the triggerpoint area of pain should alleviate pain or lower the tenderness to palpation or ttp score from the highest a 4 that is pain with withdrawal from pain source with light touch to at least a tolerates pain of deep touch as a 1 the lowest score.

When looking at how obtainable DEX is, it is a sedative and probably on a controlled substance list of unobtainable substances available for purchase, possibly narcotics. Unless you are a schizophrenic or bipolar patient with a prescription for it as a salivatory strip for buccal gland in mouth, or as an intravenous injection for sedatory effects involved in patients undergoing surgery. It seems that maslinic acid would be more obtainable. And the side effects of using DEX are lower blood pressure and possibly adverse affects with others using beta blockers as it works on the alpha 2 receptors such as in the kidneys and lungs, because mnemonics for numeric 2 say its those organs in 2, like kidneys and lungs. But you can help yourself to that information. Beta blockers are to enter parasympathetic state and dilate blood vessels in those with high blood pressure. They are in cardiovascular arteries and bronchioles of lungs to constrict in peaceful relaxing breathing controlled by diaphragm.

Myocytes are muslce cells found in skeletal muscles, heart muscles, and smooth muscles of organs like arrecter pilli muscles of hair follicles or gastrointestinal tract muscles involved in gut motility. Alpha 2 is probably the adrenergic affect for opposite restful state and to enter the stressful state of survival like fight, flight, or freeze and appease. I believe it did say that is what the MEF2C does. The pain goes down, it benefits heart myocytes, and although would be nice to give to brother in law for pain, it actually doesn’t cause pain as it increases level of pain and is beneficial for those having suffered heart disease or myocardial infection.

Just curious. But lets look at MEF2C in the original data and see if it shows anything of interest in fold change values.

str(foldchangeHealthyMyo)
## gropd_df [58,302 × 18] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ Gene                  : chr [1:58302] "ENSG00000282222" "ENSG00000282221" "ENSG00000212040" "ENSG00000110514" ...
##  $ Healthy1              : num [1:58302] 0 0 0 1.47 16.22 ...
##  $ Healthy2              : num [1:58302] 0 0 0 1.33 7.6 ...
##  $ Healthy3              : num [1:58302] 0 0 0 1.36 16.01 ...
##  $ Healthy4              : num [1:58302] 0 0 0 1.35 12.6 ...
##  $ Healthy5              : num [1:58302] 0 0 0 1.71 8.86 ...
##  $ myo1                  : num [1:58302] 0 0 0 2.83 7.82 ...
##  $ myo2                  : num [1:58302] 0 0 0 1.75 10.86 ...
##  $ myo3                  : num [1:58302] 0 0 0 2.15 13.45 ...
##  $ myo4                  : num [1:58302] 0 0 0 1.98 11.74 ...
##  $ myo5                  : num [1:58302] 0 0 0 2.02 12.39 ...
##  $ myo6                  : num [1:58302] 0 0 0 1.79 11.22 ...
##  $ myo7                  : num [1:58302] 0.114 0 0 1.537 13.544 ...
##  $ healthy_Mean          : num [1:58302] 0 0 0 1.47 12.22 ...
##  $ myo_Mean              : num [1:58302] 0 0 0 2.33 10.32 ...
##  $ healthyMedian         : num [1:58302] 0 0 0 1.47 12.72 ...
##  $ myoMedian             : num [1:58302] 0 0 0 2.33 10.32 ...
##  $ foldchangeHealthyVsMyo: num [1:58302] 0 0 0 1.581 0.844 ...
##  - attr(*, "groups")= tibble [58,302 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ Gene : chr [1:58302] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ...
##   ..$ .rows: list<int> [1:58302] 
##   .. ..$ : int 38869
##   .. ..$ : int 38873
##   .. ..$ : int 49706
##   .. ..$ : int 18694
##   .. ..$ : int 40487
##   .. ..$ : int 24716
##   .. ..$ : int 53040
##   .. ..$ : int 15195
##   .. ..$ : int 54061
##   .. ..$ : int 26056
##   .. ..$ : int 13847
##   .. ..$ : int 13846
##   .. ..$ : int 37653
##   .. ..$ : int 1644
##   .. ..$ : int 7911
##   .. ..$ : int 44718
##   .. ..$ : int 44721
##   .. ..$ : int 22888
##   .. ..$ : int 22889
##   .. ..$ : int 47075
##   .. ..$ : int 31939
##   .. ..$ : int 23289
##   .. ..$ : int 30334
##   .. ..$ : int 306
##   .. ..$ : int 305
##   .. ..$ : int 39853
##   .. ..$ : int 54981
##   .. ..$ : int 54980
##   .. ..$ : int 21768
##   .. ..$ : int 43599
##   .. ..$ : int 25384
##   .. ..$ : int 10355
##   .. ..$ : int 37504
##   .. ..$ : int 18845
##   .. ..$ : int 27843
##   .. ..$ : int 26169
##   .. ..$ : int 46396
##   .. ..$ : int 53603
##   .. ..$ : int 33348
##   .. ..$ : int 33346
##   .. ..$ : int 26783
##   .. ..$ : int 51440
##   .. ..$ : int 14013
##   .. ..$ : int 44132
##   .. ..$ : int 44126
##   .. ..$ : int 53461
##   .. ..$ : int 38371
##   .. ..$ : int 13815
##   .. ..$ : int 54201
##   .. ..$ : int 56622
##   .. ..$ : int 19688
##   .. ..$ : int 41503
##   .. ..$ : int 8817
##   .. ..$ : int 40744
##   .. ..$ : int 27517
##   .. ..$ : int 24683
##   .. ..$ : int 39933
##   .. ..$ : int 18156
##   .. ..$ : int 18157
##   .. ..$ : int 18149
##   .. ..$ : int 28784
##   .. ..$ : int 14644
##   .. ..$ : int 47779
##   .. ..$ : int 13035
##   .. ..$ : int 13029
##   .. ..$ : int 56418
##   .. ..$ : int 56420
##   .. ..$ : int 6590
##   .. ..$ : int 16740
##   .. ..$ : int 53530
##   .. ..$ : int 10193
##   .. ..$ : int 47006
##   .. ..$ : int 14670
##   .. ..$ : int 14673
##   .. ..$ : int 56514
##   .. ..$ : int 58051
##   .. ..$ : int 23139
##   .. ..$ : int 8139
##   .. ..$ : int 8140
##   .. ..$ : int 32639
##   .. ..$ : int 32640
##   .. ..$ : int 32642
##   .. ..$ : int 23978
##   .. ..$ : int 39228
##   .. ..$ : int 34797
##   .. ..$ : int 34790
##   .. ..$ : int 56429
##   .. ..$ : int 45456
##   .. ..$ : int 38881
##   .. ..$ : int 15168
##   .. ..$ : int 15172
##   .. ..$ : int 5029
##   .. ..$ : int 11482
##   .. ..$ : int 39708
##   .. ..$ : int 18010
##   .. ..$ : int 52838
##   .. ..$ : int 31078
##   .. ..$ : int 37662
##   .. ..$ : int 22460
##   .. .. [list output truncated]
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

The names or IDs for the genes are in Ensebml format. We can look up the Ensembl gene ID for MEF2C in genecards.org. The Ensemble ID is ENSG00000081189. Lets see the fold change for this gene.

MEF2C <- foldchangeHealthyMyo[which(foldchangeHealthyMyo$Gene=="ENSG00000081189"),]
MEF2C

MEF2C seems higher in the healthy samples and about the same but with some lower in expression value in the myofascial pain samples. The foldchange of pathology over healthy for fibromyalgia compared to healthy is 47.7 % MEF2C in myofascial or fibromyalgia pain samples compared to healthy samples. The study injected the MEF2C into healthy patients and compared to the fibromyalgia patients. They found that the pain biomarkers including TNF-a were all elevated and a higher tolerance to pain or ability to handle more pain was noticed even though it increased biomarkers for pain other than TNF-a.

Lets see if we can find the Ensembl ID for each of the given pain inflammatory biomarkers of TNF-a, p-P65, MyoC, and MLCK. TNF-a is ENSG00000232810, p65 is ENSG00000173039, MyoC is ENSG00000034971, and MLCK as MYLK new name ENSG00000065534. And lets use our MEF2C of ENSG00000081189.

geneID <- c("MEF2C","TNF-a","p65","MyoC","MLCK")
EnsembleID <- c("ENSG00000081189","ENSG00000232810","ENSG00000173039","ENSG00000034971","ENSG00000065534")

painGenes <- foldchangeHealthyMyo[foldchangeHealthyMyo$Gene %in% EnsembleID,]

painGenes$geneCardID <- geneID

painMarkers <- painGenes[,c(1,19,2:18)]

painMarkers

Indeed everything is lowered for pain inflammatory biomarkers except for TNF-a. So these samples we have analyzed must be after DEX is applied to the MEF2C applied samples of healthy and fibromyalgia patient samples in skeletal muscle.

Thanks again!