This markdown file for R is about myofascial pain and taken from the GSE215154 series data that found high throughput sequencing fastp data of 5 healthy samples compared to 7 samples of patients with triggerpoint myofascial pain. A search of the GSE215154 series data at NCBI gene studies will bring you to the series.

I manually downloaded each text file of the samples and the 7 text files for trigger point myofascial pain. This was the most recent study evaluating fibromyalgia patient genomic data I found a month or so ago and am now reviewing it in searching for top genes in myofascial pain or chronic pain patterns, mononucleosis, multiple sclerosis, Hodgkins disease, and Epstein-Barr infections that relate back to the common denominator at some point of life for other pathologies listed. This data science project focuses on chronic pain in those with myofascial pain patterns similar to fibromyalgia that currently does not have any diagnostic gene or associated genes or special lab testing to confirm disease. Currently only a questionnaire is available and monitoring over about 3 months while ruling out autoimmune disorders to diagnose fibromyalgia as a source of pain out of normal threshold values. This disorder was previously an ‘in your head’ disease that females got mostly and doctors would tell them to take OTC for pain and now get lifestyle in check to improve diet, sleep, stress, exercise, massage, adjustments, and meditation.

Lets read in the sample text files separately and combine them into one data frame. The sample information says that the aquisition of the data used a count of the fragments of each gene, then normalized it, and also added in the FPKM values as a final step. These are tab delimited files. When reading in the file per sample you see the gene as well as the count (of fragments then normalized) and FPKM value per gene. Internet search says the normalization is the FPKM values for the counts of fragments in RNA-seq data of transcriptional gene data using a fragment length. The gene length is multiplied by 10 billion or 10^9= 10,000,000,000, then divided by the transcripts of gene X gene length. More Information on FPKM is fragments per kilo million and TPM is transcripts per million and CPM is counts per million.

Lets load in our packages first to use later.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
healthy1 <- read.table('GSM6623971_C1.txt', header=T, sep="")
summary(healthy1)
##    gene_id             C1_FPKM             C1_count       
##  Length:58302       Min.   :0.000e+00   Min.   :     0.0  
##  Class :character   1st Qu.:0.000e+00   1st Qu.:     0.0  
##  Mode  :character   Median :0.000e+00   Median :     0.0  
##                     Mean   :5.472e+00   Mean   :   197.1  
##                     3rd Qu.:3.379e-01   3rd Qu.:    10.0  
##                     Max.   :3.058e+04   Max.   :754337.0
head(healthy1)

Read in the other 4 healthy samples as tables

healthy2 <- read.table('GSM6623972_C2.txt', header=T, sep="")
healthy3 <- read.table('GSM6623973_C3.txt', header=T, sep="")
healthy4 <- read.table('GSM6623974_C4.txt', header=T, sep="")
healthy5 <- read.table('GSM6623975_C5.txt', header=T, sep="")

Lets column bind them as they all have the same number of rows and columns, but lets remove one of the variables, the counts, and keep FPKM or fragments per kilomillion of each gene. And remove the other gene columns other than the first table.

colnames(healthy1)[2] <- 'healthy1-FPKM'
colnames(healthy2)[2] <- 'healthy2-FPKM'
colnames(healthy3)[2] <- 'healthy3-FPKM'
colnames(healthy4)[2] <- 'healthy4-FPKM'
colnames(healthy5)[2] <- 'healthy5-FPKM'
healthy5samples <- cbind(healthy1[,-3], healthy2[,-c(1,3)],healthy3[,-c(1,3)], healthy4[,-c(1,3)], healthy5[,-c(1,3)])

colnames(healthy5samples) <- c('Gene','Healthy1','Healthy2','Healthy3','Healthy4','Healthy5')
head(healthy5samples)

Lets do the same with the myofacial 7 samples.

myofascial1 <- read.table('GSM6623976_M1.txt', header=T, sep="")
myofascial2 <- read.table('GSM6623977_M2.txt', header=T, sep="")
myofascial3 <- read.table('GSM6623978_M3.txt', header=T, sep="")
myofascial4 <- read.table('GSM6623979_M4.txt', header=T, sep="")
myofascial5 <- read.table('GSM6623980_M5.txt', header=T, sep="")
myofascial6 <- read.table('GSM6623981_M6.txt', header=T, sep="")
myofascial7 <- read.table('GSM6623982_M7.txt', header=T, sep="")
colnames(myofascial1)[2] <- 'myofascial1-FPKM'
colnames(myofascial2)[2] <- 'myofascial2-FPKM'
colnames(myofascial3)[2] <- 'myofascial3-FPKM'
colnames(myofascial4)[2] <- 'myofascial4-FPKM'
colnames(myofascial5)[2] <- 'myofascial5-FPKM'
colnames(myofascial6)[2] <- 'myofascial6-FPKM'
colnames(myofascial7)[2] <- 'myofascial7-FPKM'
myofascial7samples <- cbind(myofascial1[,-3], myofascial2[,-c(1,3)],myofascial3[,-c(1,3)], myofascial4[,-c(1,3)], myofascial5[,-c(1,3)], myofascial6[,-c(1,3)], myofascial7[,-c(1,3)])

colnames(myofascial7samples) <- c('Gene','myo1','myo2','myo3','myo4','myo5','myo6','myo7')
head(myofascial7samples)

Lets column bind the healthy 5 samples to the 7 myofascial pain samples with the Gene order they all have the same number of rows of.

dataHealthy5Myo7 <- cbind(healthy5samples,myofascial7samples[,-1])
head(dataHealthy5Myo7)

Summarize the data frame just made to find the mean of each sample’s normalized FPKM values within the 58,302 genes.

summary(dataHealthy5Myo7)
##      Gene              Healthy1            Healthy2            Healthy3        
##  Length:58302       Min.   :0.000e+00   Min.   :0.000e+00   Min.   :    0.000  
##  Class :character   1st Qu.:0.000e+00   1st Qu.:0.000e+00   1st Qu.:    0.000  
##  Mode  :character   Median :0.000e+00   Median :0.000e+00   Median :    0.000  
##                     Mean   :5.472e+00   Mean   :5.188e+00   Mean   :    5.447  
##                     3rd Qu.:3.379e-01   3rd Qu.:6.688e-01   3rd Qu.:    0.464  
##                     Max.   :3.058e+04   Max.   :2.376e+04   Max.   :50031.764  
##     Healthy4            Healthy5              myo1          
##  Min.   :    0.000   Min.   :    0.000   Min.   :0.000e+00  
##  1st Qu.:    0.000   1st Qu.:    0.000   1st Qu.:0.000e+00  
##  Median :    0.000   Median :    0.000   Median :0.000e+00  
##  Mean   :    5.764   Mean   :    5.556   Mean   :4.861e+00  
##  3rd Qu.:    0.426   3rd Qu.:    0.467   3rd Qu.:7.898e-01  
##  Max.   :37775.449   Max.   :39829.283   Max.   :1.542e+04  
##       myo2                myo3                myo4          
##  Min.   :    0.000   Min.   :    0.000   Min.   :    0.000  
##  1st Qu.:    0.000   1st Qu.:    0.000   1st Qu.:    0.000  
##  Median :    0.000   Median :    0.000   Median :    0.000  
##  Mean   :    5.766   Mean   :    5.738   Mean   :    5.113  
##  3rd Qu.:    0.443   3rd Qu.:    0.309   3rd Qu.:    0.376  
##  Max.   :43180.511   Max.   :40626.886   Max.   :37108.454  
##       myo5                myo6                myo7          
##  Min.   :    0.000   Min.   :    0.000   Min.   :    0.000  
##  1st Qu.:    0.000   1st Qu.:    0.000   1st Qu.:    0.000  
##  Median :    0.000   Median :    0.000   Median :    0.000  
##  Mean   :    5.415   Mean   :    5.175   Mean   :    5.864  
##  3rd Qu.:    0.341   3rd Qu.:    0.371   3rd Qu.:    0.377  
##  Max.   :41750.951   Max.   :37369.240   Max.   :41902.286
dim(dataHealthy5Myo7)
## [1] 58302    13
write.csv(dataHealthy5Myo7, 'fragmentsPerKMill_myofascial_5healthy_7myo_58kgenes.csv',row.names=F)

Get this file we made here file

Lets get the sample mean per gene in healthy and myofascial pain samples to compare their fold change values.

healthyMeans <- dataHealthy5Myo7 %>% group_by(Gene) %>%
mutate(
healthy_Mean = mean(Healthy1:Healthy2,na.rm=T),
myo_Mean = mean(myo1:myo7)
)

colnames(healthyMeans)
##  [1] "Gene"         "Healthy1"     "Healthy2"     "Healthy3"     "Healthy4"    
##  [6] "Healthy5"     "myo1"         "myo2"         "myo3"         "myo4"        
## [11] "myo5"         "myo6"         "myo7"         "healthy_Mean" "myo_Mean"

Now get the Median of each sample group.

healthyMeansMedians <- healthyMeans %>% group_by(Gene) %>% mutate(healthyMedian=median(Healthy1:Healthy5, na.rm=T), myoMedian = median(myo1:myo7))

colnames(healthyMeansMedians)
##  [1] "Gene"          "Healthy1"      "Healthy2"      "Healthy3"     
##  [5] "Healthy4"      "Healthy5"      "myo1"          "myo2"         
##  [9] "myo3"          "myo4"          "myo5"          "myo6"         
## [13] "myo7"          "healthy_Mean"  "myo_Mean"      "healthyMedian"
## [17] "myoMedian"
head(healthyMeansMedians)

Looks like the median is the same for mean in the two samples from the first 5 entries in the data. Lets get the fold change of myofascial pain to healthy gene expression FPKM normalized values.

foldchangeHealthyMyo <- healthyMeansMedians %>% mutate(foldchangeHealthyVsMyo=(myo_Mean/(healthy_Mean+0.0000000000001)))

head(foldchangeHealthyMyo)

To avoid the NaN values of healthy 0 in denominator, I added a very very small value of one trillionth of 1. Now lets order the data by fold change from most to least and take the top 20 and bottom 20 genes and make a data frame to do some machine learning on with our 2 classes of healthy and myofascial pain.

FC_ordered <- foldchangeHealthyMyo[order(foldchangeHealthyMyo$foldchangeHealthyVsMyo, decreasing=T),]

head(FC_ordered)
tail(FC_ordered)

There may be a very vast number of genes with a zero value, so the bottom 20 won’t be revealing anything for the gene expression down regulation in myofascial pain.

sum(FC_ordered$foldchangeHealthyVsMyo==0)
## [1] 32916

There are 32,916 genes equal to zero in fold change. Since these values are normalized fragments per kilomillion, we will have to look at the bottom values above 0 that are lowest.

sum(FC_ordered$foldchangeHealthyVsMyo < 0)
## [1] 0

There are not any negative values. but more than 32,000 genes with 0 fold change, won’t give much information if ordering in decreasing order and last 32,000 genes have no information for fold change.

minFC <- subset(FC_ordered, FC_ordered$foldchangeHealthyVsMyo > 0)

tail(minFC)

We removed those 32,000 genes that had no information to add in fold change values for this RNA-seq data, and still have 25,386 genes left. We can now take the bottom 20 and top 20 to make a data frame to use in training a model to predict the best in our 2 classes of healthy or myofascial pain.

minMax20_FC <- minFC[c(1:20,25367:25386),]
dim(minMax20_FC)
## [1] 40 18

That gives us 40 genes to work with. Lets see those genes.These are Ensemble genes. They may or may not have alternate Genecard.org gene names but should have a summary on Genecards.org that we can read all about once we get our top genes.

minMax20_FC$Gene
##  [1] "ENSG00000267727" "ENSG00000178372" "ENSG00000172867" "ENSG00000188508"
##  [5] "ENSG00000178934" "ENSG00000167768" "ENSG00000114771" "ENSG00000095970"
##  [9] "ENSG00000267368" "ENSG00000203782" "ENSG00000189001" "ENSG00000160883"
## [13] "ENSG00000271367" "ENSG00000284627" "ENSG00000204421" "ENSG00000248871"
## [17] "ENSG00000203618" "ENSG00000243466" "ENSG00000105427" "ENSG00000186458"
## [21] "ENSG00000281899" "ENSG00000235590" "ENSG00000125740" "ENSG00000180818"
## [25] "ENSG00000244193" "ENSG00000110848" "ENSG00000081041" "ENSG00000143816"
## [29] "ENSG00000136457" "ENSG00000134184" "ENSG00000163734" "ENSG00000242288"
## [33] "ENSG00000108342" "ENSG00000138135" "ENSG00000107518" "ENSG00000148677"
## [37] "ENSG00000205362" "ENSG00000007908" "ENSG00000255730" "ENSG00000136244"
write.csv(minMax20_FC, 'min20max20geneFoldchangeMyofascialPain.csv', row.names=F)

Get the file we just wrote here

colnames(minMax20_FC)
##  [1] "Gene"                   "Healthy1"               "Healthy2"              
##  [4] "Healthy3"               "Healthy4"               "Healthy5"              
##  [7] "myo1"                   "myo2"                   "myo3"                  
## [10] "myo4"                   "myo5"                   "myo6"                  
## [13] "myo7"                   "healthy_Mean"           "myo_Mean"              
## [16] "healthyMedian"          "myoMedian"              "foldchangeHealthyVsMyo"

Make the data frame but keep gene names in a character vector to make header in new transposed data frame to get samples as rows and the 40 genes as the features and add in a class feature for the sample class where we only have 12 samples of 5 healthy and 7 myofascial.

geneNames <- minMax20_FC$Gene

genes40_DF <- data.frame(t(minMax20_FC[,c(2:13)]))


colnames(genes40_DF) <- geneNames

genes40_DF$class <- c('healthy','healthy','healthy','healthy','healthy','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain')

head(genes40_DF)

Change the class from character to factor to allow confusionMatrix function to work later in this program.

genes40_DF$class <- as.factor(genes40_DF$class)
summary(genes40_DF)
##  ENSG00000267727     ENSG00000178372   ENSG00000172867    ENSG00000188508 
##  Min.   :  0.00000   Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.000  
##  1st Qu.:  0.00000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.000  
##  Median :  0.00000   Median : 0.0000   Median : 0.00000   Median : 0.000  
##  Mean   : 16.50957   Mean   : 7.4553   Mean   : 3.90308   Mean   : 3.192  
##  3rd Qu.:  0.09357   3rd Qu.: 0.1931   3rd Qu.: 0.01573   3rd Qu.: 0.000  
##  Max.   :196.66379   Max.   :79.0971   Max.   :41.40356   Max.   :30.425  
##  ENSG00000178934    ENSG00000167768   ENSG00000114771  ENSG00000095970  
##  Min.   : 0.00000   Min.   : 0.0000   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 0.00000   1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median : 0.00000   Median : 0.0000   Median : 0.000   Median : 0.0000  
##  Mean   : 2.97886   Mean   : 2.6479   Mean   : 1.733   Mean   : 1.5420  
##  3rd Qu.: 0.04052   3rd Qu.: 0.3158   3rd Qu.: 0.000   3rd Qu.: 0.0117  
##  Max.   :26.20979   Max.   :24.8739   Max.   :19.873   Max.   :18.0198  
##  ENSG00000267368   ENSG00000203782 ENSG00000189001 ENSG00000160883  
##  Min.   : 0.0000   Min.   : 0.00   Min.   :0.000   Min.   :0.00000  
##  1st Qu.: 0.2323   1st Qu.: 0.00   1st Qu.:0.000   1st Qu.:0.07218  
##  Median : 3.9112   Median : 0.00   Median :0.000   Median :0.15182  
##  Mean   : 3.8415   Mean   : 1.25   Mean   :1.088   Mean   :1.04723  
##  3rd Qu.: 4.8437   3rd Qu.: 0.00   3rd Qu.:0.000   3rd Qu.:0.35011  
##  Max.   :12.8966   Max.   :12.68   Max.   :9.905   Max.   :9.61607  
##  ENSG00000271367  ENSG00000284627  ENSG00000204421  ENSG00000248871 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.3512   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.3838   Median :0.0000   Median :0.0000  
##  Mean   :0.8291   Mean   :0.9975   Mean   :0.6937   Mean   :0.9914  
##  3rd Qu.:0.1984   3rd Qu.:0.6804   3rd Qu.:0.0000   3rd Qu.:0.4922  
##  Max.   :8.9279   Max.   :6.4229   Max.   :7.2011   Max.   :7.5098  
##  ENSG00000203618  ENSG00000243466  ENSG00000105427   ENSG00000186458  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.09635   Median :0.00000  
##  Mean   :0.6235   Mean   :0.6009   Mean   :0.62817   Mean   :0.55108  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.34734   3rd Qu.:0.01939  
##  Max.   :7.4821   Max.   :6.4375   Max.   :5.37288   Max.   :6.37223  
##  ENSG00000281899   ENSG00000235590   ENSG00000125740   ENSG00000180818 
##  Min.   : 0.2017   Min.   :0.01404   Min.   :0.00000   Min.   :0.1822  
##  1st Qu.: 1.2082   1st Qu.:0.34122   1st Qu.:0.04294   1st Qu.:0.5636  
##  Median : 2.1032   Median :0.46236   Median :0.06920   Median :2.0618  
##  Mean   : 3.2158   Mean   :0.58643   Mean   :0.49287   Mean   :2.7698  
##  3rd Qu.: 4.0849   3rd Qu.:0.83383   3rd Qu.:0.38365   3rd Qu.:4.1641  
##  Max.   :10.3186   Max.   :1.39007   Max.   :4.14229   Max.   :9.1767  
##  ENSG00000244193  ENSG00000110848   ENSG00000081041    ENSG00000143816  
##  Min.   :0.1126   Min.   :0.00000   Min.   :  0.4343   Min.   :0.03081  
##  1st Qu.:0.1698   1st Qu.:0.00000   1st Qu.:  1.5015   1st Qu.:0.31764  
##  Median :0.2284   Median :0.04472   Median :  3.0891   Median :0.41963  
##  Mean   :1.8863   Mean   :0.30633   Mean   : 16.2635   Mean   :0.50944  
##  3rd Qu.:4.3612   3rd Qu.:0.18548   3rd Qu.: 10.9835   3rd Qu.:0.56658  
##  Max.   :6.6017   Max.   :2.60310   Max.   :140.2944   Max.   :1.20174  
##  ENSG00000136457   ENSG00000134184  ENSG00000163734   ENSG00000242288  
##  Min.   : 0.1767   Min.   :0.2122   Min.   :0.05608   Min.   :0.01604  
##  1st Qu.: 0.4568   1st Qu.:0.2489   1st Qu.:0.07749   1st Qu.:0.20070  
##  Median : 0.7175   Median :0.3456   Median :0.13468   Median :0.47441  
##  Mean   : 3.8902   Mean   :3.1536   Mean   :0.66453   Mean   :0.66045  
##  3rd Qu.: 1.3671   3rd Qu.:7.1168   3rd Qu.:0.22441   3rd Qu.:0.94741  
##  Max.   :23.7187   Max.   :8.9442   Max.   :6.20963   Max.   :2.08076  
##  ENSG00000108342   ENSG00000138135    ENSG00000107518   ENSG00000148677   
##  Min.   : 0.0000   Min.   : 0.00000   Min.   :0.00000   Min.   :  0.8035  
##  1st Qu.: 0.1687   1st Qu.: 0.06229   1st Qu.:0.09542   1st Qu.:  3.2329  
##  Median : 0.3364   Median : 0.21163   Median :0.16875   Median :  4.4409  
##  Mean   : 2.8969   Mean   : 2.60864   Mean   :0.27688   Mean   : 28.7645  
##  3rd Qu.: 0.5461   3rd Qu.: 0.33091   3rd Qu.:0.44621   3rd Qu.: 11.0251  
##  Max.   :31.1028   Max.   :29.18270   Max.   :0.93217   Max.   :216.5660  
##  ENSG00000205362    ENSG00000007908    ENSG00000255730    ENSG00000136244  
##  Min.   :  0.0000   Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0000  
##  1st Qu.:  0.0000   1st Qu.: 0.00000   1st Qu.: 0.01814   1st Qu.: 0.0838  
##  Median :  0.4034   Median : 0.04708   Median : 0.03505   Median : 0.1816  
##  Mean   : 13.3179   Mean   : 2.81096   Mean   : 1.58185   Mean   : 3.7752  
##  3rd Qu.:  0.6260   3rd Qu.: 0.40651   3rd Qu.: 0.05590   3rd Qu.: 0.2399  
##  Max.   :155.6212   Max.   :31.90556   Max.   :18.62400   Max.   :43.7147  
##      class  
##  healthy:5  
##  myoPain:7  
##             
##             
##             
## 

Lets just make sure we have the same number in each class sample and only 2 classes say if a typo or extra space was added in making class factor vector.

genes40_DF %>% group_by(class) %>% count(class)

There are only 2 classes and same number in each class that can be inspected visually to see all healthy at top and all myoPain at bottom. Now we can start using some modeling algorithms in the caret package we installed earlier. Lets use boot strapping in this model instead of cross validation as we have to simulate randomness and more samples from independently and identically distributing with replacement. We will use a classifier for ensemble as well as linear modeling.

control_random <- trainControl(method = "boot", number = 50)  
metric = "Accuracy"
intrain <- sample(1:12, .8*12)
training <- genes40_DF[intrain,]
testing <- genes40_DF[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitRF <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF <- predict(fitRF, testing)
RF_df <- data.frame(predictRF, type=testing$class)
RF_df

We can see above the model predicted all the myoPain class and only half the healthy class correctly with 2/3 of data correctly predicted.

importance <- varImp(fitRF)
importance
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000081041  100.00
## ENSG00000163734   85.69
## ENSG00000148677   81.13
## ENSG00000235590   77.80
## ENSG00000138135   67.40
## ENSG00000108342   67.19
## ENSG00000107518   64.12
## ENSG00000205362   62.33
## ENSG00000143816   59.11
## ENSG00000180818   55.96
## ENSG00000255730   50.72
## ENSG00000125740   46.38
## ENSG00000284627   44.40
## ENSG00000136457   40.03
## ENSG00000242288   38.65
## ENSG00000134184   37.20
## ENSG00000281899   36.44
## ENSG00000105427   36.16
## ENSG00000178372   35.89
## ENSG00000267368   35.38
plot(importance)

lets compare the values in our top gene above that is ENSG00000081041. We can see with our small sample data of 12 samples that there is one sample that is an outlier and much higher than the rest of the samples.

topGene <- genes40_DF[order(genes40_DF$ENSG00000081041,decreasing =T),]
topGene$ENSG00000081041
##  [1] 140.2943774  15.3743775  13.3465208  10.1957661   4.1482004   3.6057617
##  [7]   2.5723394   1.7667014   1.6710862   0.9928994   0.7591185   0.4343249

Most all samples are under 20 except one and its in the healthy sample set. Maybe we should use the median values as those are realistic to the true population rather than outliers that can skew the data.That could be another R markdown project to do later.

confusionMatrix(predictRF, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       1       0
##    myoPain       0       2
##                                      
##                Accuracy : 1          
##                  95% CI : (0.2924, 1)
##     No Information Rate : 0.6667     
##     P-Value [Acc > NIR] : 0.2963     
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.3333     
##          Detection Rate : 0.3333     
##    Detection Prevalence : 0.3333     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : healthy    
## 

Our model using bootstrap in random forest an ensemble classifier and regression algorithm scored pretty well at 67% accuracy. There were limited samples here but bootstrapping helps to get a natural set of samples using the Central limit Theorem and other principles from econometrics.

Lets quickly test how increasing to 100 bootstrap samples will change the metric results.

control_random <- trainControl(method = "boot", number = 100)  
metric = "Accuracy"

set.seed(567)

fitRF2 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF2 <- predict(fitRF2, testing)
RF_df2 <- data.frame(predictRF2, type=testing$class)
RF_df2
importance2 <- varImp(fitRF2)
importance2
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000081041  100.00
## ENSG00000235590   99.47
## ENSG00000163734   88.73
## ENSG00000148677   77.81
## ENSG00000138135   65.94
## ENSG00000136244   63.83
## ENSG00000180818   61.24
## ENSG00000143816   59.06
## ENSG00000107518   55.11
## ENSG00000125740   54.06
## ENSG00000205362   53.45
## ENSG00000108342   52.13
## ENSG00000255730   47.54
## ENSG00000134184   47.15
## ENSG00000267368   45.92
## ENSG00000007908   40.99
## ENSG00000284627   40.33
## ENSG00000281899   38.47
## ENSG00000136457   37.83
## ENSG00000271367   36.46
plot(importance2)

confusionMatrix(predictRF2, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       1       0
##    myoPain       0       2
##                                      
##                Accuracy : 1          
##                  95% CI : (0.2924, 1)
##     No Information Rate : 0.6667     
##     P-Value [Acc > NIR] : 0.2963     
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.3333     
##          Detection Rate : 0.3333     
##    Detection Prevalence : 0.3333     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : healthy    
## 

The results were the same using 100 bootstrapped samples vs 50. Lets try 1000 bootstrap samples and see if better.

control_random <- trainControl(method = "boot", number = 1000)  
metric = "Accuracy"

set.seed(567)

fitRF3 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF3 <- predict(fitRF3, testing)
RF_df3 <- data.frame(predictRF3, type=testing$class)
RF_df3
importance3 <- varImp(fitRF3)
importance3
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000081041  100.00
## ENSG00000235590   74.90
## ENSG00000163734   73.40
## ENSG00000136244   67.52
## ENSG00000205362   59.31
## ENSG00000136457   52.55
## ENSG00000148677   52.10
## ENSG00000180818   50.69
## ENSG00000281899   50.43
## ENSG00000138135   45.77
## ENSG00000108342   44.41
## ENSG00000125740   44.36
## ENSG00000255730   43.99
## ENSG00000107518   43.85
## ENSG00000267368   39.26
## ENSG00000143816   38.08
## ENSG00000284627   36.08
## ENSG00000134184   35.98
## ENSG00000110848   35.12
## ENSG00000178372   34.17
plot(importance3)

confusionMatrix(predictRF3, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       1       0
##    myoPain       0       2
##                                      
##                Accuracy : 1          
##                  95% CI : (0.2924, 1)
##     No Information Rate : 0.6667     
##     P-Value [Acc > NIR] : 0.2963     
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.3333     
##          Detection Rate : 0.3333     
##    Detection Prevalence : 0.3333     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : healthy    
## 

The same results in accuracy, sensitivity, and specificity with 1000 bootstrapped samples but is the sample being correctly identified changing with each run from 50, 100, and 1000 resamples? Lets see the data frames we have of predicted versus the healthy class that stays the same.

RF_df
RF_df2
RF_df3

Looks like the predicted class correctly identified the same sample in each run and misclassified the same sample in each run. The random forest runs had some warnings produced and more with the 1000 bootstrapped samples.

Lets get those overall genes from the importance of variables output and write it to csv to add the genomic data to the top genes.

DF <- data.frame(importance[1])

DF

Order in descending order.

DF$Gene <- row.names(DF)
DataTop5 <- DF[,c(2,1)]
row.names(DataTop5) <- NULL
DataTop5genes <- DataTop5[order(DataTop5$Overall, decreasing=T),]
DataTop5genes
GenesTop5 <- DataTop5genes[1:5,]

write.csv(GenesTop5, "top5GenesMyalgia.csv", row.names=F)

GenesTop5

Get the file we just wrote here and we will add the genecards.org summaries to those genes with alternate names if available to understand these top genes that change the most in myalgia chronic pain.

Thank you for viewing this file as we understand more about data science and genes’ RNA-seq data that identifies chronic pain in myalgia sufferers.

We have more to do, like get summary information to at least the top importance genes in this RF bootstrapped model, try the fold change based on median instead to weed out the outliers skewing data in our top importance gene, and try out some other models and algorithms and parameter tuning. We can use a bunch of features in the caret package and get better at analyzing and exploring our data. Plots could be useful with the package ggplot2 that is a good package to use to visualize the data as we work with it.

Note to all, that the results in the general user interface (gui) were much worse than how html knitr output the code chunks with 100% accuracy and perfect summary statistics. But in the rstudio gui, the percentages with the seed set to 567 all produced same results on 50,100, and 1000 bootstrap samples with 67% accuracy, 50% sensitivity, and 100% specificity, but when running the code in html knitr, the results showed a completely different output of 100%. There are some bugs to work out and that may be only in html and not in pdf, but will try now for pdf. The knitr wouldn’t output the file in pdf form due to the confusion matrix factors where ’the data cannot have more levels than the reference.’So, that was changed. Internet search showed that the reason the accuracy was 100% on html and pdf knitr and not in the Gui is because the r chunks have to be changed from cache=TRUE to cache=FALSE. The default is cache=TRUE. I am not sure why it showed 100% because none of my cached confusionMatrix results showed 100%. That actually didn’t work either for html but did work for pdf. The genes are also different on the html version, where here it is one ending in 81041 as most important, in html knitr version it is a different gene entirely and not one in this version’s top 5, it ends in 180818. This is not good. Not sure why that is yet. But the results cannot be reproducible if a bug in the programming interface is preventing reproducibility.

Thanks again!