knitr::opts_chunk$set(cache = FALSE)

This markdown file for R is about myofascial pain and taken from the GSE215154 series data that found high throughput sequencing fastp data of 5 healthy samples compared to 7 samples of patients with triggerpoint myofascial pain. A search of the GSE215154 series data at NCBI gene studies will bring you to the series.

The series is published under Homo Sapiens or humans for the gene expression profiles, but the study says it was done on rats in the text description. They injected a known pain inflammatory biomarker MEF2C or myocyte enhancer factor 2 into rats and saw increased pain AKA lowered pain threshold but then injected one of those inflammatory biomarkers in healthy samples and saw TNF-a create increased pain or lowered thresholds to pain. They then administered the inflammatory biomarker inhibitor, dexmedetomidine or DEX into both samples and saw pain threshold increase and inflammatory biomarkers decreased. These samples are what was available so uncertain if these are samples before or after injection of MEF2C that causes musculoskeletal pain or after injection of DEX to inhibit pain. But there are clear differences in the samples as the genes have clear separation in levels of gene expression in healthy vs. the myofascial pain groups.

I manually downloaded each text file of the samples and the 7 text files for trigger point myofascial pain. This was the most recent study evaluating fibromyalgia patient genomic data I found a month or so ago and am now reviewing it in searching for top genes in myofascial pain or chronic pain patterns, mononucleosis, multiple sclerosis, Hodgkins disease, and Epstein-Barr infections that relate back to the common denominator at some point of life for other pathologies listed. This data science project focuses on chronic pain in those with myofascial pain patterns similar to fibromyalgia that currently does not have any diagnostic gene or associated genes or special lab testing to confirm disease. Currently only a questionnaire is available and monitoring over about 3 months while ruling out autoimmune disorders to diagnose fibromyalgia as a source of pain out of normal threshold values. This disorder was previously an ‘in your head’ disease that females got mostly and doctors would tell them to take OTC for pain and now get lifestyle in check to improve diet, sleep, stress, exercise, massage, adjustments, and meditation.

Lets read in the sample text files separately and combine them into one data frame. The sample information says that the aquisition of the data used a count of the fragments of each gene, then normalized it, and also added in the FPKM values as a final step. These are tab delimited files. When reading in the file per sample you see the gene as well as the count (of fragments then normalized) and FPKM value per gene. Internet search says the normalization is the FPKM values for the counts of fragments in RNA-seq data of transcriptional gene data using a fragment length. The gene length is multiplied by 10 billion or 10^9= 10,000,000,000, then divided by the transcripts of gene X gene length. More Information on FPKM is fragments per kilo million and TPM is transcripts per million and CPM is counts per million.

Lets load in our packages first to use later.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
healthy1 <- read.table('GSM6623971_C1.txt', header=T, sep="")
summary(healthy1)
##    gene_id             C1_FPKM             C1_count       
##  Length:58302       Min.   :0.000e+00   Min.   :     0.0  
##  Class :character   1st Qu.:0.000e+00   1st Qu.:     0.0  
##  Mode  :character   Median :0.000e+00   Median :     0.0  
##                     Mean   :5.472e+00   Mean   :   197.1  
##                     3rd Qu.:3.379e-01   3rd Qu.:    10.0  
##                     Max.   :3.058e+04   Max.   :754337.0
head(healthy1)

Read in the other 4 healthy samples as tables

healthy2 <- read.table('GSM6623972_C2.txt', header=T, sep="")
healthy3 <- read.table('GSM6623973_C3.txt', header=T, sep="")
healthy4 <- read.table('GSM6623974_C4.txt', header=T, sep="")
healthy5 <- read.table('GSM6623975_C5.txt', header=T, sep="")

Lets column bind them as they all have the same number of rows and columns, but lets remove one of the variables, the counts, and keep FPKM or fragments per kilomillion of each gene. And remove the other gene columns other than the first table.

colnames(healthy1)[2] <- 'healthy1-FPKM'
colnames(healthy2)[2] <- 'healthy2-FPKM'
colnames(healthy3)[2] <- 'healthy3-FPKM'
colnames(healthy4)[2] <- 'healthy4-FPKM'
colnames(healthy5)[2] <- 'healthy5-FPKM'
healthy5samples <- cbind(healthy1[,-3], healthy2[,-c(1,3)],healthy3[,-c(1,3)], healthy4[,-c(1,3)], healthy5[,-c(1,3)])

colnames(healthy5samples) <- c('Gene','Healthy1','Healthy2','Healthy3','Healthy4','Healthy5')
head(healthy5samples)

Lets do the same with the myofacial 7 samples.

myofascial1 <- read.table('GSM6623976_M1.txt', header=T, sep="")
myofascial2 <- read.table('GSM6623977_M2.txt', header=T, sep="")
myofascial3 <- read.table('GSM6623978_M3.txt', header=T, sep="")
myofascial4 <- read.table('GSM6623979_M4.txt', header=T, sep="")
myofascial5 <- read.table('GSM6623980_M5.txt', header=T, sep="")
myofascial6 <- read.table('GSM6623981_M6.txt', header=T, sep="")
myofascial7 <- read.table('GSM6623982_M7.txt', header=T, sep="")
colnames(myofascial1)[2] <- 'myofascial1-FPKM'
colnames(myofascial2)[2] <- 'myofascial2-FPKM'
colnames(myofascial3)[2] <- 'myofascial3-FPKM'
colnames(myofascial4)[2] <- 'myofascial4-FPKM'
colnames(myofascial5)[2] <- 'myofascial5-FPKM'
colnames(myofascial6)[2] <- 'myofascial6-FPKM'
colnames(myofascial7)[2] <- 'myofascial7-FPKM'
myofascial7samples <- cbind(myofascial1[,-3], myofascial2[,-c(1,3)],myofascial3[,-c(1,3)], myofascial4[,-c(1,3)], myofascial5[,-c(1,3)], myofascial6[,-c(1,3)], myofascial7[,-c(1,3)])

colnames(myofascial7samples) <- c('Gene','myo1','myo2','myo3','myo4','myo5','myo6','myo7')
head(myofascial7samples)

Lets column bind the healthy 5 samples to the 7 myofascial pain samples with the Gene order they all have the same number of rows of.

dataHealthy5Myo7 <- cbind(healthy5samples,myofascial7samples[,-1])
head(dataHealthy5Myo7)

Summarize the data frame just made to find the mean of each sample’s normalized FPKM values within the 58,302 genes.

summary(dataHealthy5Myo7)
##      Gene              Healthy1            Healthy2            Healthy3        
##  Length:58302       Min.   :0.000e+00   Min.   :0.000e+00   Min.   :    0.000  
##  Class :character   1st Qu.:0.000e+00   1st Qu.:0.000e+00   1st Qu.:    0.000  
##  Mode  :character   Median :0.000e+00   Median :0.000e+00   Median :    0.000  
##                     Mean   :5.472e+00   Mean   :5.188e+00   Mean   :    5.447  
##                     3rd Qu.:3.379e-01   3rd Qu.:6.688e-01   3rd Qu.:    0.464  
##                     Max.   :3.058e+04   Max.   :2.376e+04   Max.   :50031.764  
##     Healthy4            Healthy5              myo1          
##  Min.   :    0.000   Min.   :    0.000   Min.   :0.000e+00  
##  1st Qu.:    0.000   1st Qu.:    0.000   1st Qu.:0.000e+00  
##  Median :    0.000   Median :    0.000   Median :0.000e+00  
##  Mean   :    5.764   Mean   :    5.556   Mean   :4.861e+00  
##  3rd Qu.:    0.426   3rd Qu.:    0.467   3rd Qu.:7.898e-01  
##  Max.   :37775.449   Max.   :39829.283   Max.   :1.542e+04  
##       myo2                myo3                myo4          
##  Min.   :    0.000   Min.   :    0.000   Min.   :    0.000  
##  1st Qu.:    0.000   1st Qu.:    0.000   1st Qu.:    0.000  
##  Median :    0.000   Median :    0.000   Median :    0.000  
##  Mean   :    5.766   Mean   :    5.738   Mean   :    5.113  
##  3rd Qu.:    0.443   3rd Qu.:    0.309   3rd Qu.:    0.376  
##  Max.   :43180.511   Max.   :40626.886   Max.   :37108.454  
##       myo5                myo6                myo7          
##  Min.   :    0.000   Min.   :    0.000   Min.   :    0.000  
##  1st Qu.:    0.000   1st Qu.:    0.000   1st Qu.:    0.000  
##  Median :    0.000   Median :    0.000   Median :    0.000  
##  Mean   :    5.415   Mean   :    5.175   Mean   :    5.864  
##  3rd Qu.:    0.341   3rd Qu.:    0.371   3rd Qu.:    0.377  
##  Max.   :41750.951   Max.   :37369.240   Max.   :41902.286
dim(dataHealthy5Myo7)
## [1] 58302    13
write.csv(dataHealthy5Myo7, 'fragmentsPerKMill_myofascial_5healthy_7myo_58kgenes.csv',row.names=F)

Get this file we made here file

Lets get the sample mean per gene in healthy and myofascial pain samples to compare their fold change values.

healthyMeans <- dataHealthy5Myo7 %>% group_by(Gene) %>%
mutate(
healthy_Mean = mean(Healthy1:Healthy2,na.rm=T),
myo_Mean = mean(myo1:myo7)
)

colnames(healthyMeans)
##  [1] "Gene"         "Healthy1"     "Healthy2"     "Healthy3"     "Healthy4"    
##  [6] "Healthy5"     "myo1"         "myo2"         "myo3"         "myo4"        
## [11] "myo5"         "myo6"         "myo7"         "healthy_Mean" "myo_Mean"

Now get the Median of each sample group.

healthyMeansMedians <- healthyMeans %>% group_by(Gene) %>% mutate(healthyMedian=median(Healthy1:Healthy5, na.rm=T), myoMedian = median(myo1:myo7))

colnames(healthyMeansMedians)
##  [1] "Gene"          "Healthy1"      "Healthy2"      "Healthy3"     
##  [5] "Healthy4"      "Healthy5"      "myo1"          "myo2"         
##  [9] "myo3"          "myo4"          "myo5"          "myo6"         
## [13] "myo7"          "healthy_Mean"  "myo_Mean"      "healthyMedian"
## [17] "myoMedian"
head(healthyMeansMedians)

Looks like the median is the same for mean in the three samples from the first 5 entries in the data.

*** Means of Healthy Vs Myofascial Pain Data Frame Top 40 Genes ***

Lets get the fold change of myofascial pain to healthy gene expression FPKM normalized values.

foldchangeHealthyMyo <- healthyMeansMedians %>% mutate(foldchangeHealthyVsMyo=(myo_Mean/(healthy_Mean+0.0000000000001)))

head(foldchangeHealthyMyo)

To avoid the NaN values of healthy 0 in denominator, I added a very very small value of one trillionth of 1. Now lets order the data by fold change from most to least and take the top 20 and bottom 20 genes and make a data frame to do some machine learning on with our 2 classes of healthy and myofascial pain.

I want to add in a Median fold change column as well.

foldchangeHealthyMyoMedians <- foldchangeHealthyMyo %>% mutate(foldchangeHealthyVsMyoMedians=(myoMedian/(healthyMedian+0.0000000000001)))

head(foldchangeHealthyMyoMedians)
FC_ordered <- foldchangeHealthyMyo[order(foldchangeHealthyMyo$foldchangeHealthyVsMyo, decreasing=T),]

write.csv(FC_ordered, 'meansFC_healthyVsMyoAllGenes.csv', row.names=F)

Get the file of all genes ordered greatest to least in fold change by means of healthy versus the myofascial pain sample means here.

head(FC_ordered)
tail(FC_ordered)

There may be a very vast number of genes with a zero value, so the bottom 20 won’t be revealing anything for the gene expression down regulation in myofascial pain.

sum(FC_ordered$foldchangeHealthyVsMyo==0)
## [1] 32916

There are 32,916 genes equal to zero in fold change. Since these values are normalized fragments per kilomillion, we will have to look at the bottom values above 0 that are lowest.

sum(FC_ordered$foldchangeHealthyVsMyo < 0)
## [1] 0

There are not any negative values. but more than 32,000 genes with 0 fold change, won’t give much information if ordering in decreasing order and last 32,000 genes have no information for fold change.

minFC <- subset(FC_ordered, FC_ordered$foldchangeHealthyVsMyo > 0)
write.csv(minFC, 'FC_ordered_means_zeroFCs_removed.csv', row.names=F)
tail(minFC)

We removed those 32,000 genes that had no information to add in fold change values for this RNA-seq data, and still have 25,386 genes left. We can now take the bottom 20 and top 20 to make a data frame to use in training a model to predict the best in our 2 classes of healthy or myofascial pain. You can get the means table of all genes with values in fold change by means of samples healthy versus myofascial pain here.

minMax20_FC <- minFC[c(1:20,25367:25386),]
dim(minMax20_FC)
## [1] 40 18

That gives us 40 genes to work with. Lets see those genes.These are Ensemble genes. They may or may not have alternate Genecard.org gene names but should have a summary on Genecards.org that we can read all about once we get our top genes.

minMax20_FC$Gene
##  [1] "ENSG00000267727" "ENSG00000178372" "ENSG00000172867" "ENSG00000188508"
##  [5] "ENSG00000178934" "ENSG00000167768" "ENSG00000114771" "ENSG00000095970"
##  [9] "ENSG00000267368" "ENSG00000203782" "ENSG00000189001" "ENSG00000160883"
## [13] "ENSG00000271367" "ENSG00000284627" "ENSG00000204421" "ENSG00000248871"
## [17] "ENSG00000203618" "ENSG00000243466" "ENSG00000105427" "ENSG00000186458"
## [21] "ENSG00000281899" "ENSG00000235590" "ENSG00000125740" "ENSG00000180818"
## [25] "ENSG00000244193" "ENSG00000110848" "ENSG00000081041" "ENSG00000143816"
## [29] "ENSG00000136457" "ENSG00000134184" "ENSG00000163734" "ENSG00000242288"
## [33] "ENSG00000108342" "ENSG00000138135" "ENSG00000107518" "ENSG00000148677"
## [37] "ENSG00000205362" "ENSG00000007908" "ENSG00000255730" "ENSG00000136244"
write.csv(minMax20_FC, 'min20max20geneFoldchangeMyofascialPain.csv', row.names=F)

Get the file we just wrote here

colnames(minMax20_FC)
##  [1] "Gene"                   "Healthy1"               "Healthy2"              
##  [4] "Healthy3"               "Healthy4"               "Healthy5"              
##  [7] "myo1"                   "myo2"                   "myo3"                  
## [10] "myo4"                   "myo5"                   "myo6"                  
## [13] "myo7"                   "healthy_Mean"           "myo_Mean"              
## [16] "healthyMedian"          "myoMedian"              "foldchangeHealthyVsMyo"

Make the data frame but keep gene names in a character vector to make header in new transposed data frame to get samples as rows and the 40 genes as the features and add in a class feature for the sample class where we only have 12 samples of 5 healthy and 7 myofascial.

geneNames <- minMax20_FC$Gene

genes40_DF <- data.frame(t(minMax20_FC[,c(2:13)]))


colnames(genes40_DF) <- geneNames

genes40_DF$class <- c('healthy','healthy','healthy','healthy','healthy','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain')

head(genes40_DF)

Change the class from character to factor to allow confusionMatrix function to work later in this program.

genes40_DF$class <- as.factor(genes40_DF$class)
summary(genes40_DF)
##  ENSG00000267727     ENSG00000178372   ENSG00000172867    ENSG00000188508 
##  Min.   :  0.00000   Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.000  
##  1st Qu.:  0.00000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.: 0.000  
##  Median :  0.00000   Median : 0.0000   Median : 0.00000   Median : 0.000  
##  Mean   : 16.50957   Mean   : 7.4553   Mean   : 3.90308   Mean   : 3.192  
##  3rd Qu.:  0.09357   3rd Qu.: 0.1931   3rd Qu.: 0.01573   3rd Qu.: 0.000  
##  Max.   :196.66379   Max.   :79.0971   Max.   :41.40356   Max.   :30.425  
##  ENSG00000178934    ENSG00000167768   ENSG00000114771  ENSG00000095970  
##  Min.   : 0.00000   Min.   : 0.0000   Min.   : 0.000   Min.   : 0.0000  
##  1st Qu.: 0.00000   1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.0000  
##  Median : 0.00000   Median : 0.0000   Median : 0.000   Median : 0.0000  
##  Mean   : 2.97886   Mean   : 2.6479   Mean   : 1.733   Mean   : 1.5420  
##  3rd Qu.: 0.04052   3rd Qu.: 0.3158   3rd Qu.: 0.000   3rd Qu.: 0.0117  
##  Max.   :26.20979   Max.   :24.8739   Max.   :19.873   Max.   :18.0198  
##  ENSG00000267368   ENSG00000203782 ENSG00000189001 ENSG00000160883  
##  Min.   : 0.0000   Min.   : 0.00   Min.   :0.000   Min.   :0.00000  
##  1st Qu.: 0.2323   1st Qu.: 0.00   1st Qu.:0.000   1st Qu.:0.07218  
##  Median : 3.9112   Median : 0.00   Median :0.000   Median :0.15182  
##  Mean   : 3.8415   Mean   : 1.25   Mean   :1.088   Mean   :1.04723  
##  3rd Qu.: 4.8437   3rd Qu.: 0.00   3rd Qu.:0.000   3rd Qu.:0.35011  
##  Max.   :12.8966   Max.   :12.68   Max.   :9.905   Max.   :9.61607  
##  ENSG00000271367  ENSG00000284627  ENSG00000204421  ENSG00000248871 
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.3512   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.3838   Median :0.0000   Median :0.0000  
##  Mean   :0.8291   Mean   :0.9975   Mean   :0.6937   Mean   :0.9914  
##  3rd Qu.:0.1984   3rd Qu.:0.6804   3rd Qu.:0.0000   3rd Qu.:0.4922  
##  Max.   :8.9279   Max.   :6.4229   Max.   :7.2011   Max.   :7.5098  
##  ENSG00000203618  ENSG00000243466  ENSG00000105427   ENSG00000186458  
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.09635   Median :0.00000  
##  Mean   :0.6235   Mean   :0.6009   Mean   :0.62817   Mean   :0.55108  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.34734   3rd Qu.:0.01939  
##  Max.   :7.4821   Max.   :6.4375   Max.   :5.37288   Max.   :6.37223  
##  ENSG00000281899   ENSG00000235590   ENSG00000125740   ENSG00000180818 
##  Min.   : 0.2017   Min.   :0.01404   Min.   :0.00000   Min.   :0.1822  
##  1st Qu.: 1.2082   1st Qu.:0.34122   1st Qu.:0.04294   1st Qu.:0.5636  
##  Median : 2.1032   Median :0.46236   Median :0.06920   Median :2.0618  
##  Mean   : 3.2158   Mean   :0.58643   Mean   :0.49287   Mean   :2.7698  
##  3rd Qu.: 4.0849   3rd Qu.:0.83383   3rd Qu.:0.38365   3rd Qu.:4.1641  
##  Max.   :10.3186   Max.   :1.39007   Max.   :4.14229   Max.   :9.1767  
##  ENSG00000244193  ENSG00000110848   ENSG00000081041    ENSG00000143816  
##  Min.   :0.1126   Min.   :0.00000   Min.   :  0.4343   Min.   :0.03081  
##  1st Qu.:0.1698   1st Qu.:0.00000   1st Qu.:  1.5015   1st Qu.:0.31764  
##  Median :0.2284   Median :0.04472   Median :  3.0891   Median :0.41963  
##  Mean   :1.8863   Mean   :0.30633   Mean   : 16.2635   Mean   :0.50944  
##  3rd Qu.:4.3612   3rd Qu.:0.18548   3rd Qu.: 10.9835   3rd Qu.:0.56658  
##  Max.   :6.6017   Max.   :2.60310   Max.   :140.2944   Max.   :1.20174  
##  ENSG00000136457   ENSG00000134184  ENSG00000163734   ENSG00000242288  
##  Min.   : 0.1767   Min.   :0.2122   Min.   :0.05608   Min.   :0.01604  
##  1st Qu.: 0.4568   1st Qu.:0.2489   1st Qu.:0.07749   1st Qu.:0.20070  
##  Median : 0.7175   Median :0.3456   Median :0.13468   Median :0.47441  
##  Mean   : 3.8902   Mean   :3.1536   Mean   :0.66453   Mean   :0.66045  
##  3rd Qu.: 1.3671   3rd Qu.:7.1168   3rd Qu.:0.22441   3rd Qu.:0.94741  
##  Max.   :23.7187   Max.   :8.9442   Max.   :6.20963   Max.   :2.08076  
##  ENSG00000108342   ENSG00000138135    ENSG00000107518   ENSG00000148677   
##  Min.   : 0.0000   Min.   : 0.00000   Min.   :0.00000   Min.   :  0.8035  
##  1st Qu.: 0.1687   1st Qu.: 0.06229   1st Qu.:0.09542   1st Qu.:  3.2329  
##  Median : 0.3364   Median : 0.21163   Median :0.16875   Median :  4.4409  
##  Mean   : 2.8969   Mean   : 2.60864   Mean   :0.27688   Mean   : 28.7645  
##  3rd Qu.: 0.5461   3rd Qu.: 0.33091   3rd Qu.:0.44621   3rd Qu.: 11.0251  
##  Max.   :31.1028   Max.   :29.18270   Max.   :0.93217   Max.   :216.5660  
##  ENSG00000205362    ENSG00000007908    ENSG00000255730    ENSG00000136244  
##  Min.   :  0.0000   Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0000  
##  1st Qu.:  0.0000   1st Qu.: 0.00000   1st Qu.: 0.01814   1st Qu.: 0.0838  
##  Median :  0.4034   Median : 0.04708   Median : 0.03505   Median : 0.1816  
##  Mean   : 13.3179   Mean   : 2.81096   Mean   : 1.58185   Mean   : 3.7752  
##  3rd Qu.:  0.6260   3rd Qu.: 0.40651   3rd Qu.: 0.05590   3rd Qu.: 0.2399  
##  Max.   :155.6212   Max.   :31.90556   Max.   :18.62400   Max.   :43.7147  
##      class  
##  healthy:5  
##  myoPain:7  
##             
##             
##             
## 

Lets just make sure we have the same number in each class sample and only 2 classes say if a typo or extra space was added in making class factor vector.

genes40_DF %>% group_by(class) %>% count(class)

There are only 2 classes and same number in each class that can be inspected visually to see all healthy at top and all myoPain at bottom.

*** Medians of Fold change samples Healthy vs Myofascial Pain Top 40 Genes ***

Lets now make the Medians fold change ordered and get top 40 of the bottom 20 and top 20 genes.

FC_ordered_Medians <- foldchangeHealthyMyoMedians[order(foldchangeHealthyMyoMedians$foldchangeHealthyVsMyoMedians, decreasing=T),]

head(FC_ordered_Medians)
tail(FC_ordered_Medians)
write.csv(FC_ordered_Medians, 'Medians_healthyVsMyo_Foldchanges.csv', row.names=F)

You can get the huge file ordered greatest to least by Median fold change of healthy vs myofascial pain samples here.

sum(FC_ordered_Medians$foldchangeHealthyVsMyoMedians==0)
## [1] 32916

There are 32,916 zeros in the fold change by median values of healthy vs. myofascial pain. Lets remove those and take the top and bottom 20 genes to get a table of 40 genes as our top genes to work with and run machine learning algorithms on.

FC_MediansNoZeros <- subset(FC_ordered_Medians[FC_ordered_Medians$foldchangeHealthyVsMyoMedians>0,])

write.csv(FC_MediansNoZeros,'FCs_Medians_NoZeros.csv', row.names=F)
dim(FC_MediansNoZeros)
## [1] 25386    19

We removed those genes with zero fold change values from the medians of samples healthy vs myofascial pain to get a data frame of 25,386 genes. You can get the table of non zero value fold change by Median genes here

Lets now get those top 20 and bottom 20 genes from the list of genes.

top20bottom20MedianGenes <- FC_MediansNoZeros[c(1:20,25367:25386),]
summary(top20bottom20MedianGenes)
##      Gene              Healthy1         Healthy2            Healthy3      
##  Length:40          Min.   : 0.000   Min.   :  0.00000   Min.   : 0.0000  
##  Class :character   1st Qu.: 0.000   1st Qu.:  0.05743   1st Qu.: 0.0000  
##  Mode  :character   Median : 0.000   Median :  0.50037   Median : 0.1177  
##                     Mean   : 2.766   Mean   :  8.30262   Mean   : 2.7103  
##                     3rd Qu.: 0.693   3rd Qu.:  2.27550   3rd Qu.: 0.3678  
##                     Max.   :72.463   Max.   :216.56605   Max.   :64.4520  
##     Healthy4           Healthy5            myo1                myo2         
##  Min.   : 0.00000   Min.   : 0.0000   Min.   :8.795e-03   Min.   : 0.00000  
##  1st Qu.: 0.00000   1st Qu.: 0.0000   1st Qu.:3.585e-02   1st Qu.: 0.04465  
##  Median : 0.08485   Median : 0.1202   Median :6.316e+00   Median : 0.33300  
##  Mean   : 1.85268   Mean   : 4.3458   Mean   :1.608e+01   Mean   : 1.54796  
##  3rd Qu.: 0.49532   3rd Qu.: 0.4465   3rd Qu.:1.984e+01   3rd Qu.: 1.08150  
##  Max.   :39.58015   Max.   :69.9603   Max.   :1.967e+02   Max.   :10.75277  
##       myo3              myo4              myo5              myo6        
##  Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.0946   Median : 0.1029   Median : 0.1156   Median : 0.0949  
##  Mean   : 1.4023   Mean   : 2.2102   Mean   : 1.2225   Mean   : 2.2560  
##  3rd Qu.: 0.3863   3rd Qu.: 0.5392   3rd Qu.: 0.5261   3rd Qu.: 0.7664  
##  Max.   :20.7971   Max.   :41.7730   Max.   :21.5034   Max.   :47.1921  
##       myo7          healthy_Mean         myo_Mean         healthyMedian     
##  Min.   : 0.0000   Min.   :  0.0000   Min.   : 0.008795   Min.   : 0.00000  
##  1st Qu.: 0.0424   1st Qu.:  0.0000   1st Qu.: 0.035852   1st Qu.: 0.00000  
##  Median : 0.3253   Median :  0.4684   Median : 4.316198   Median : 0.08199  
##  Mean   : 1.9863   Mean   :  5.4656   Mean   : 9.134233   Mean   : 3.56562  
##  3rd Qu.: 1.6686   3rd Qu.:  1.2626   3rd Qu.:10.467134   3rd Qu.: 0.77035  
##  Max.   :20.6234   Max.   :110.5132   Max.   :99.163787   Max.   :69.46271  
##    myoMedian         foldchangeHealthyVsMyo foldchangeHealthyVsMyoMedians
##  Min.   : 0.008795   Min.   :0.000e+00      Min.   :0.000e+00            
##  1st Qu.: 0.035852   1st Qu.:0.000e+00      1st Qu.:0.000e+00            
##  Median : 4.316198   Median :5.000e+00      Median :2.464e+13            
##  Mean   : 9.134233   Mean   :6.653e+13      Mean   :8.954e+13            
##  3rd Qu.:10.467134   3rd Qu.:5.938e+13      3rd Qu.:1.047e+14            
##  Max.   :99.163787   Max.   :9.916e+14      Max.   :9.916e+14
dim(top20bottom20MedianGenes)
## [1] 40 19
colnames(top20bottom20MedianGenes)
##  [1] "Gene"                          "Healthy1"                     
##  [3] "Healthy2"                      "Healthy3"                     
##  [5] "Healthy4"                      "Healthy5"                     
##  [7] "myo1"                          "myo2"                         
##  [9] "myo3"                          "myo4"                         
## [11] "myo5"                          "myo6"                         
## [13] "myo7"                          "healthy_Mean"                 
## [15] "myo_Mean"                      "healthyMedian"                
## [17] "myoMedian"                     "foldchangeHealthyVsMyo"       
## [19] "foldchangeHealthyVsMyoMedians"

Lets make the data frame of the median fold change values top and bottom 20 genes.

geneNames2 <- top20bottom20MedianGenes$Gene

genes40_DF2 <- data.frame(t(top20bottom20MedianGenes[,c(2:13)]))


colnames(genes40_DF2) <- geneNames2

genes40_DF2$class <- c('healthy','healthy','healthy','healthy','healthy','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain')

head(genes40_DF2)

Change the class from character to factor to allow confusionMatrix function to work later in this program.

genes40_DF2$class <- as.factor(genes40_DF2$class)
summary(genes40_DF2)
##  ENSG00000267727     ENSG00000178372   ENSG00000172867    ENSG00000268292
##  Min.   :  0.00000   Min.   : 0.0000   Min.   : 0.00000   Min.   : 0.00  
##  1st Qu.:  0.00000   1st Qu.: 0.0000   1st Qu.: 0.00000   1st Qu.:10.41  
##  Median :  0.00000   Median : 0.0000   Median : 0.00000   Median :17.72  
##  Mean   : 16.50957   Mean   : 7.4553   Mean   : 3.90308   Mean   :18.63  
##  3rd Qu.:  0.09357   3rd Qu.: 0.1931   3rd Qu.: 0.01573   3rd Qu.:22.45  
##  Max.   :196.66379   Max.   :79.0971   Max.   :41.40356   Max.   :47.19  
##  ENSG00000188508  ENSG00000178934    ENSG00000221932    ENSG00000167768  
##  Min.   : 0.000   Min.   : 0.00000   Min.   : 0.00000   Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.: 0.00000   1st Qu.: 0.06286   1st Qu.: 0.0000  
##  Median : 0.000   Median : 0.00000   Median : 0.18006   Median : 0.0000  
##  Mean   : 3.192   Mean   : 2.97886   Mean   : 3.37609   Mean   : 2.6479  
##  3rd Qu.: 0.000   3rd Qu.: 0.04052   3rd Qu.: 0.89296   3rd Qu.: 0.3158  
##  Max.   :30.425   Max.   :26.20979   Max.   :33.05159   Max.   :24.8739  
##  ENSG00000088882    ENSG00000114771  ENSG00000170509   ENSG00000095970  
##  Min.   : 0.00000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.0000  
##  1st Qu.: 0.00000   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.0000  
##  Median : 0.07849   Median : 0.000   Median : 0.1921   Median : 0.0000  
##  Mean   : 2.38692   Mean   : 1.733   Mean   : 1.8987   Mean   : 1.5420  
##  3rd Qu.: 0.24971   3rd Qu.: 0.000   3rd Qu.: 0.2760   3rd Qu.: 0.0117  
##  Max.   :26.57517   Max.   :19.873   Max.   :19.8319   Max.   :18.0198  
##  ENSG00000150594    ENSG00000215472  ENSG00000203782 ENSG00000189001
##  Min.   : 0.00000   Min.   : 0.000   Min.   : 0.00   Min.   :0.000  
##  1st Qu.: 0.01805   1st Qu.: 0.000   1st Qu.: 0.00   1st Qu.:0.000  
##  Median : 0.08601   Median : 0.000   Median : 0.00   Median :0.000  
##  Mean   : 1.57281   Mean   : 2.637   Mean   : 1.25   Mean   :1.088  
##  3rd Qu.: 0.21670   3rd Qu.: 0.000   3rd Qu.: 0.00   3rd Qu.:0.000  
##  Max.   :16.39539   Max.   :16.261   Max.   :12.68   Max.   :9.905  
##  ENSG00000261011   ENSG00000254418   ENSG00000160883   ENSG00000271367 
##  Min.   : 0.0000   Min.   : 0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 0.0000   1st Qu.: 0.0000   1st Qu.:0.07218   1st Qu.:0.0000  
##  Median : 0.1714   Median : 0.0000   Median :0.15182   Median :0.0000  
##  Mean   : 1.3674   Mean   : 1.1150   Mean   :1.04723   Mean   :0.8291  
##  3rd Qu.: 0.5330   3rd Qu.: 0.1554   3rd Qu.:0.35011   3rd Qu.:0.1984  
##  Max.   :11.3877   Max.   :10.7022   Max.   :9.61607   Max.   :8.9279  
##  ENSG00000223343   ENSG00000128242    ENSG00000284686   ENSG00000281899  
##  Min.   : 0.0000   Min.   :0.009605   Min.   :0.00000   Min.   : 0.2017  
##  1st Qu.: 0.0000   1st Qu.:0.047588   1st Qu.:0.01057   1st Qu.: 1.2082  
##  Median : 0.2482   Median :0.079693   Median :0.09984   Median : 2.1032  
##  Mean   : 4.2921   Mean   :0.072940   Mean   :0.23689   Mean   : 3.2158  
##  3rd Qu.: 8.9518   3rd Qu.:0.087743   3rd Qu.:0.44291   3rd Qu.: 4.0849  
##  Max.   :13.2922   Max.   :0.163985   Max.   :0.75737   Max.   :10.3186  
##  ENSG00000244193  ENSG00000250978  ENSG00000158445   ENSG00000187634  
##  Min.   :0.1126   Min.   : 3.108   Min.   :0.03338   Min.   :0.01168  
##  1st Qu.:0.1698   1st Qu.: 3.843   1st Qu.:0.32956   1st Qu.:0.03209  
##  Median :0.2284   Median : 9.829   Median :0.52626   Median :0.10846  
##  Mean   :1.8863   Mean   :26.484   Mean   :0.57346   Mean   :0.51485  
##  3rd Qu.:4.3612   3rd Qu.:45.798   3rd Qu.:0.84399   3rd Qu.:1.10092  
##  Max.   :6.6017   Max.   :72.463   Max.   :1.17963   Max.   :1.43221  
##  ENSG00000166676   ENSG00000148677    ENSG00000008056   ENSG00000182324   
##  Min.   :0.01007   Min.   :  0.8035   Min.   :0.01939   Min.   :0.008795  
##  1st Qu.:0.05938   1st Qu.:  3.2329   1st Qu.:0.18026   1st Qu.:0.060531  
##  Median :0.19850   Median :  4.4409   Median :0.28265   Median :0.127856  
##  Mean   :0.15357   Mean   : 28.7645   Mean   :0.30978   Mean   :0.172201  
##  3rd Qu.:0.21901   3rd Qu.: 11.0251   3rd Qu.:0.39815   3rd Qu.:0.217277  
##  Max.   :0.24314   Max.   :216.5660   Max.   :0.86260   Max.   :0.611513  
##  ENSG00000134184  ENSG00000180818  ENSG00000224114   ENSG00000273000  
##  Min.   :0.2122   Min.   :0.1822   Min.   : 0.0000   Min.   :0.00000  
##  1st Qu.:0.2489   1st Qu.:0.5636   1st Qu.: 0.2082   1st Qu.:0.00000  
##  Median :0.3456   Median :2.0618   Median : 0.7615   Median :0.06005  
##  Mean   :3.1536   Mean   :2.7698   Mean   : 3.6193   Mean   :0.21045  
##  3rd Qu.:7.1168   3rd Qu.:4.1641   3rd Qu.: 6.3297   3rd Qu.:0.34091  
##  Max.   :8.9442   Max.   :9.1767   Max.   :11.8816   Max.   :1.15551  
##  ENSG00000235590   ENSG00000143816   ENSG00000242288   ENSG00000107518  
##  Min.   :0.01404   Min.   :0.03081   Min.   :0.01604   Min.   :0.00000  
##  1st Qu.:0.34122   1st Qu.:0.31764   1st Qu.:0.20070   1st Qu.:0.09542  
##  Median :0.46236   Median :0.41963   Median :0.47441   Median :0.16875  
##  Mean   :0.58643   Mean   :0.50944   Mean   :0.66045   Mean   :0.27688  
##  3rd Qu.:0.83383   3rd Qu.:0.56658   3rd Qu.:0.94741   3rd Qu.:0.44621  
##  Max.   :1.39007   Max.   :1.20174   Max.   :2.08076   Max.   :0.93217  
##      class  
##  healthy:5  
##  myoPain:7  
##             
##             
##             
## 

Lets just make sure we have the same number in each class sample and only 2 classes say if a typo or extra space was added in making class factor vector.

genes40_DF2 %>% group_by(class) %>% count(class)

*** Compare the top 40 genes of fold change in Means Vs Medians of samples in Healthy Vs. Myofascial Pain ***

This is the 2nd part to the fibromyalgia markdown in R project and we added the medians to compare to the top genes by medians to weed out the outliers skewing results. But kept the means data in this file to compare with the medians results for top genes.

means40genes <- colnames(genes40_DF)[1:40]
medians40genes <- colnames(genes40_DF2)[1:40]
means40genesOrdered <- means40genes[order(means40genes, decreasing=T)]
means40genesOrdered
##  [1] "ENSG00000284627" "ENSG00000281899" "ENSG00000271367" "ENSG00000267727"
##  [5] "ENSG00000267368" "ENSG00000255730" "ENSG00000248871" "ENSG00000244193"
##  [9] "ENSG00000243466" "ENSG00000242288" "ENSG00000235590" "ENSG00000205362"
## [13] "ENSG00000204421" "ENSG00000203782" "ENSG00000203618" "ENSG00000189001"
## [17] "ENSG00000188508" "ENSG00000186458" "ENSG00000180818" "ENSG00000178934"
## [21] "ENSG00000178372" "ENSG00000172867" "ENSG00000167768" "ENSG00000163734"
## [25] "ENSG00000160883" "ENSG00000148677" "ENSG00000143816" "ENSG00000138135"
## [29] "ENSG00000136457" "ENSG00000136244" "ENSG00000134184" "ENSG00000125740"
## [33] "ENSG00000114771" "ENSG00000110848" "ENSG00000108342" "ENSG00000107518"
## [37] "ENSG00000105427" "ENSG00000095970" "ENSG00000081041" "ENSG00000007908"
medians40genesOrdered <- medians40genes[order(medians40genes, decreasing=T)]
medians40genesOrdered
##  [1] "ENSG00000284686" "ENSG00000281899" "ENSG00000273000" "ENSG00000271367"
##  [5] "ENSG00000268292" "ENSG00000267727" "ENSG00000261011" "ENSG00000254418"
##  [9] "ENSG00000250978" "ENSG00000244193" "ENSG00000242288" "ENSG00000235590"
## [13] "ENSG00000224114" "ENSG00000223343" "ENSG00000221932" "ENSG00000215472"
## [17] "ENSG00000203782" "ENSG00000189001" "ENSG00000188508" "ENSG00000187634"
## [21] "ENSG00000182324" "ENSG00000180818" "ENSG00000178934" "ENSG00000178372"
## [25] "ENSG00000172867" "ENSG00000170509" "ENSG00000167768" "ENSG00000166676"
## [29] "ENSG00000160883" "ENSG00000158445" "ENSG00000150594" "ENSG00000148677"
## [33] "ENSG00000143816" "ENSG00000134184" "ENSG00000128242" "ENSG00000114771"
## [37] "ENSG00000107518" "ENSG00000095970" "ENSG00000088882" "ENSG00000008056"
meansMediansSame <- subset(means40genes, means40genes==medians40genes)
meansMediansSame
## [1] "ENSG00000267727" "ENSG00000178372" "ENSG00000172867" "ENSG00000244193"

There are only 4 genes in the top 40 genes of the means of samples healthy vs. myofascial pain that are the same as the genes top 40 of medians of those samples. Lets get these genes in their own file data frame to get summaries of and continue to run machine learning on the means and then the medians of the top 40 genes that are actually the top 20 over expressed and top 20 under expressed genes in a fold change comparison of myofascial pain to healthy by mean of samples and then median of samples.

common4genes <- top20bottom20MedianGenes[top20bottom20MedianGenes$Gene %in% meansMediansSame,]
write.csv(common4genes, 'genes4CommonMediansMeansFCs.csv', row.names=F)
common4genes

You can get the file that we just wrote out here. I will add the alternate Genecards symbol and summaries with summaries of 3 main summaries paraphrased to generalize what the gene does or its associations in the human body.

We are going to do machine learning on the means data frame of top 40 genes with 3 runs for bootstrap with 50, 100, and 1000 samples. Then we will do the same computations on the data of median genes to see which classifies the best the realistic unskewed data (medians) or the means data that doesn’t assume proportions are incrementally the same and outliers are allowed.

*** Machine Learning on Means of Samples Fold Change top 40 genes ***

Now we can start using some modeling algorithms in the caret package we installed earlier. Lets use boot strapping in this model instead of cross validation as we have to simulate randomness and more samples from independently and identically distributing with replacement. We will use a classifier for ensemble as well as linear modeling.

control_random <- trainControl(method = "boot", number = 50)  
metric = "Accuracy"

set.seed(12345)
intrain <- sample(1:12, .8*12)
training <- genes40_DF[intrain,]
testing <- genes40_DF[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitRF <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF <- predict(fitRF, testing)
RF_df <- data.frame(predictRF, type=testing$class)
RF_df

We can see above the model predicted all the myoPain class and only half the healthy class correctly with 2/3 of data correctly predicted.

importance <- varImp(fitRF)
importance
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000138135 100.000
## ENSG00000108342  92.336
## ENSG00000180818  91.058
## ENSG00000267368  86.496
## ENSG00000081041  84.854
## ENSG00000136457  49.088
## ENSG00000235590  38.139
## ENSG00000163734  20.803
## ENSG00000007908  20.073
## ENSG00000143816  19.526
## ENSG00000148677  18.978
## ENSG00000136244  18.248
## ENSG00000134184  16.241
## ENSG00000110848  16.241
## ENSG00000255730  13.869
## ENSG00000205362  12.774
## ENSG00000281899  10.584
## ENSG00000125740  10.219
## ENSG00000244193   8.759
## ENSG00000107518   6.569
plot(importance)

lets compare the values in our top gene above that is ENSG00000081041. We can see with our small sample data of 12 samples that there is one sample that is an outlier and much higher than the rest of the samples.

topGene <- genes40_DF[order(genes40_DF$ENSG00000081041,decreasing =T),]
topGene$ENSG00000081041
##  [1] 140.2943774  15.3743775  13.3465208  10.1957661   4.1482004   3.6057617
##  [7]   2.5723394   1.7667014   1.6710862   0.9928994   0.7591185   0.4343249

Most all samples are under 20 except one and its in the healthy sample set. Maybe we should use the median values as those are realistic to the true population rather than outliers that can skew the data.That could be another R markdown project to do later.

confusionMatrix(predictRF, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       0       0
##    myoPain       2       1
##                                           
##                Accuracy : 0.3333          
##                  95% CI : (0.0084, 0.9057)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.9630          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.3333          
##              Prevalence : 0.6667          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : healthy         
## 

Our model using bootstrap in random forest an ensemble classifier and regression algorithm scored pretty well in the 1st part of this project at 67% accuracy. There were limited samples here but bootstrapping helps to get a natural set of samples using the Central limit Theorem and other principles from econometrics. But here only 1 of the 3 samples was corretly identified with an accuracy of 33% which is not very good at all.

Lets quickly test how increasing to 100 bootstrap samples will change the metric results.

control_random <- trainControl(method = "boot", number = 100)  
metric = "Accuracy"

set.seed(567)

fitRF2 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF2 <- predict(fitRF2, testing)
RF_df2 <- data.frame(predictRF2, type=testing$class)
RF_df2
importance2 <- varImp(fitRF2)
importance2
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000108342 100.000
## ENSG00000081041  85.236
## ENSG00000138135  78.691
## ENSG00000180818  75.951
## ENSG00000267368  63.927
## ENSG00000136457  44.140
## ENSG00000235590  31.507
## ENSG00000143816  25.266
## ENSG00000148677  18.113
## ENSG00000134184  17.504
## ENSG00000163734  14.003
## ENSG00000255730  12.024
## ENSG00000136244  11.720
## ENSG00000110848  11.263
## ENSG00000125740   7.915
## ENSG00000244193   6.697
## ENSG00000205362   5.936
## ENSG00000007908   5.936
## ENSG00000248871   3.044
## ENSG00000107518   2.740
plot(importance2)

confusionMatrix(predictRF2, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       1       0
##    myoPain       1       1
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.0943, 0.9916)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.7407          
##                                           
##                   Kappa : 0.4             
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5000          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.6667          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.3333          
##       Balanced Accuracy : 0.7500          
##                                           
##        'Positive' Class : healthy         
## 

The results were the same using 100 bootstrapped samples vs 50 as the first part of this project’s separate earlier posting of 67%, and better here on 2nd run with 100 bootstrapped samples rather than 50 as last run done. Lets try 1000 bootstrap samples and see if better.

control_random <- trainControl(method = "boot", number = 1000)  
metric = "Accuracy"

set.seed(567)

fitRF3 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF3 <- predict(fitRF3, testing)
RF_df3 <- data.frame(predictRF3, type=testing$class)
RF_df3
importance3 <- varImp(fitRF3)
importance3
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000138135 100.000
## ENSG00000267368  80.899
## ENSG00000081041  76.124
## ENSG00000108342  61.096
## ENSG00000180818  59.831
## ENSG00000136457  32.163
## ENSG00000235590  30.899
## ENSG00000148677  21.208
## ENSG00000007908  15.730
## ENSG00000134184  14.045
## ENSG00000143816  13.062
## ENSG00000110848  11.798
## ENSG00000255730   9.270
## ENSG00000163734   8.567
## ENSG00000136244   7.725
## ENSG00000205362   7.022
## ENSG00000281899   4.073
## ENSG00000172867   3.933
## ENSG00000107518   2.949
## ENSG00000105427   2.669
plot(importance3)

confusionMatrix(predictRF3, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       0       0
##    myoPain       2       1
##                                           
##                Accuracy : 0.3333          
##                  95% CI : (0.0084, 0.9057)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.9630          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.3333          
##              Prevalence : 0.6667          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : healthy         
## 

The results with 1000 bootstrapped samples was poor on this 3rd run at 33% the same as the 1st run of 50 bootstrapped samples. Lets look again at the data frame of predicted results for the 50, 100, and then 1000 bootstrapped samples side by side to the test classification that was supposed to be identical to or predicted correctly.

The 50 bootstrapped samples below.

RF_df

The 100 bootstrapped samples below.

RF_df2

The 1000 bootstrapped samples below

RF_df3

In this version the 3rd sample is correctly identified as myoPain in all runs but only the 100 bootstrapped run correctly identified one of the healthy classes that was the first test observation.

Lets get those overall genes from the importance of variables output and write it to csv to add the genomic data to the top genes.

DF <- data.frame(importance[1])

DF

Order in descending order.

DF$Gene <- row.names(DF)
DataTop5 <- DF[,c(2,1)]
row.names(DataTop5) <- NULL
DataTop5genes <- DataTop5[order(DataTop5$Overall, decreasing=T),]
DataTop5genes
GenesTop5 <- DataTop5genes[1:5,]

write.csv(GenesTop5, "top5GenesMyalgia_part2.csv", row.names=F)

GenesTop5

Get the file we just wrote for part2 of means of samples top 5 genes here the alternate genecards names, summaries, and my own worded summary based on those summaries by genecards, NCBI, and UniProt will be added later.

The top5 genes of part 1 by means of samples is here and each one had an alternate Genecard name so that as well as the genecards.org, NCBI, and UniProt summaries as well as my own paraphrased summary of the gene based on those 3 summaries are added to this file.

Lets see if cross validation in the parameter tuning for random forest in the caret package of R has similar results in classification

control_random <- trainControl(method = "cv", number = 100)  
metric = "Accuracy"

set.seed(567)

fitRF4 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF4 <- predict(fitRF4, testing)
RF_df4 <- data.frame(predictRF4, type=testing$class)
RF_df4
importance4 <- varImp(fitRF4)
importance4
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000108342  100.00
## ENSG00000180818   97.13
## ENSG00000267368   93.22
## ENSG00000138135   84.87
## ENSG00000081041   70.93
## ENSG00000136457   55.24
## ENSG00000148677   54.27
## ENSG00000235590   50.92
## ENSG00000007908   43.90
## ENSG00000107518   43.32
## ENSG00000143816   39.55
## ENSG00000163734   38.46
## ENSG00000255730   36.22
## ENSG00000136244   32.42
## ENSG00000110848   30.19
## ENSG00000281899   26.41
## ENSG00000134184   26.07
## ENSG00000125740   24.65
## ENSG00000244193   24.41
## ENSG00000267727   24.08
plot(importance4)

confusionMatrix(predictRF4, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       0       0
##    myoPain       2       1
##                                           
##                Accuracy : 0.3333          
##                  95% CI : (0.0084, 0.9057)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.9630          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.0000          
##             Specificity : 1.0000          
##          Pos Pred Value :    NaN          
##          Neg Pred Value : 0.3333          
##              Prevalence : 0.6667          
##          Detection Rate : 0.0000          
##    Detection Prevalence : 0.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : healthy         
## 

Using cross validation with 100 folds, the same number of bootstrap samples in parameters of random forest classifier, returned poor results here at 33% by predicting the only healthy class the same as all the runs of the bootstrapped means of samples in earlier three runs. Good to compare. Will see what is output in published version to compare as well.

So, there were problems in knitr not running the same resulting outputs and classifying much better in the knitr html and pdf version and it will probably be the same for this run when published, but just know that the chunk run outputs have a different cache to empty and clear that could be related to the latex software using that runs its own memory store and it was not corrected because not solved and will likely produce outputs not the same as in markdown runs inside Rstudio.

*** Machine running on the medians of the top 40 genes in healthy vs myofascial pain samples using 50, 100, and 1000 bootstrapped samples of random forest classifier in R ***

Now we run the same algorithms with the medians data sets.

Bootstrapping 50 samples in parameters of random forest classifier with the medians’ top 40 genes.

control_random <- trainControl(method = "boot", number = 50)  
metric = "Accuracy"

set.seed(12345)
intrain <- sample(1:12, .8*12)
training <- genes40_DF2[intrain,]
testing <- genes40_DF2[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitMedians1 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRFmedians <- predict(fitMedians1, testing)
RF_df_medians <- data.frame(predictRFmedians, type=testing$class)
RF_df_medians

We can see above the model predicted all the myopain and healthy classes which were 1 and 2 respectively with 100% accuracy using the medians of the samples.

importanceMedians <- varImp(fitMedians1)
importanceMedians
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000180818 100.000
## ENSG00000250978  91.475
## ENSG00000268292  79.637
## ENSG00000235590  49.134
## ENSG00000166676  38.204
## ENSG00000148677  34.384
## ENSG00000128242  31.532
## ENSG00000134184  28.788
## ENSG00000143816  23.757
## ENSG00000284686  16.381
## ENSG00000244193   7.130
## ENSG00000008056   6.403
## ENSG00000281899   5.381
## ENSG00000187634   4.843
## ENSG00000221932   4.305
## ENSG00000273000   3.313
## ENSG00000224114   3.094
## ENSG00000172867   3.010
## ENSG00000160883   2.421
## ENSG00000107518   2.287
plot(importanceMedians)

lets compare the values in our top gene above that is ENSG00000180818.

topGene <- genes40_DF2[order(genes40_DF2$ENSG00000180818,decreasing =T),]
topGene$ENSG00000180818
##  [1] 9.1766551 5.5757884 5.3240499 3.7774643 2.9742342 2.9142322 1.2094433
##  [8] 0.7935440 0.6286581 0.3682484 0.3130012 0.1822497

These median values above are realistic to the true population rather than outliers that can skew the data. That is a tangent that came out earlier in the first part and what we explore now in this 2nd part of our machine learning of top genes in healthy vs myofascial pain samples. It appears the values are between 0.18 and 9.17 for the top gene produced on medians of the samples in bootstrapped 50 sample random forest modeling parameter.

confusionMatrix(predictRFmedians, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       2       0
##    myoPain       0       1
##                                      
##                Accuracy : 1          
##                  95% CI : (0.2924, 1)
##     No Information Rate : 0.6667     
##     P-Value [Acc > NIR] : 0.2963     
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6667     
##          Detection Rate : 0.6667     
##    Detection Prevalence : 0.6667     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : healthy    
## 

In Rstudio the results are 100% accuracy and same for sensitivity and specificity. Two healthy samples were correctly identified and the one myopain sample was correctly identified.

Lets see if there is consistency with 100 and 1000 bootstrapped samples on the medians top 40 genes of healthy vs myofascial pain samples.

Bootstrap of 100 samples in Random Forest classifier on the top 40 by medians of the fold change comparison of myofascial pain vs. healthy samples.

control_random <- trainControl(method = "boot", number = 100)  
metric = "Accuracy"

set.seed(12345)
intrain <- sample(1:12, .8*12)
training <- genes40_DF2[intrain,]
testing <- genes40_DF2[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitMedians2 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRFmedians2 <- predict(fitMedians2, testing)
RF_df_medians2 <- data.frame(predictRFmedians2, type=testing$class)
RF_df_medians2

We can see above the model predicted only the healthy classes correctly and not the last class of myoPain.

importanceMedians2 <- varImp(fitMedians2)
importanceMedians2
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000180818 100.000
## ENSG00000268292  88.257
## ENSG00000250978  81.335
## ENSG00000235590  44.005
## ENSG00000166676  37.577
## ENSG00000148677  28.677
## ENSG00000128242  26.452
## ENSG00000134184  17.923
## ENSG00000143816  16.934
## ENSG00000284686  12.608
## ENSG00000221932   8.405
## ENSG00000244193   7.664
## ENSG00000172867   4.944
## ENSG00000187634   4.574
## ENSG00000224114   3.585
## ENSG00000182324   3.585
## ENSG00000281899   3.461
## ENSG00000267727   2.472
## ENSG00000008056   2.472
## ENSG00000242288   1.978
plot(importanceMedians2)

lets compare the values in our top gene above that is ENSG00000180818 that is the same top gene in earlier run of 50 bootstrap samples.

topGene <- genes40_DF2[order(genes40_DF2$ENSG00000180818,decreasing =T),]
topGene$ENSG00000180818
##  [1] 9.1766551 5.5757884 5.3240499 3.7774643 2.9742342 2.9142322 1.2094433
##  [8] 0.7935440 0.6286581 0.3682484 0.3130012 0.1822497

These median values above are realistic to the true population rather than outliers that can skew the data.

confusionMatrix(predictRFmedians2, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       2       1
##    myoPain       0       0
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.0943, 0.9916)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.7407          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.6667          
##          Detection Rate : 0.6667          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : healthy         
## 

Seems to be inconsistent, not the same accuracy of excellence in using more bootstrapped samples of 100 compared to 50. Lets see if even more samples of 1000 bootstrapped samples will perform well.

Bootstrap of 1000 samples in Random Forest classifier on the top 40 by medians of the fold change comparison of myofascial pain vs. healthy samples.

control_random <- trainControl(method = "boot", number = 1000)  
metric = "Accuracy"

set.seed(12345)
intrain <- sample(1:12, .8*12)
training <- genes40_DF2[intrain,]
testing <- genes40_DF2[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitMedians3 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRFmedians3 <- predict(fitMedians3, testing)
RF_df_medians3 <- data.frame(predictRFmedians3, type=testing$class)
RF_df_medians3

We can see above the model predicted all the myoPain incorrectly and healthy classes correctly, which were 1 and 2 respectively with 100% accuracy for healthy classes but 0% accuracy for myopain class using the medians of the samples and 1000 bootstrap samples.There is a reason for this, in using bootstrap of samples the majority of my econometric courses said having 40 samples is needed to get more realistic results in bootstrapping a large population out of a small population using the Central Limit Theorem and identically and independently distributing the data with replacement. This could also be a case of garbage in and garbage out as data scientists say. But many algorithms to use to see if a better result can be found in K-nearest neighbors, support vector machines in radial or concentric levels or linearly separable data, decision trees of rcart in caret for boosting models using learning rates, where bagging its bootstrap aggregating, k-means, bayesian classification models, linear discriminate analysis, and maybe a few more.

importanceMedians3 <- varImp(fitMedians3)
importanceMedians3
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000180818 100.000
## ENSG00000268292  99.099
## ENSG00000250978  80.566
## ENSG00000235590  45.946
## ENSG00000148677  32.046
## ENSG00000134184  31.145
## ENSG00000128242  23.166
## ENSG00000143816  22.136
## ENSG00000166676  19.562
## ENSG00000284686  14.414
## ENSG00000221932   8.623
## ENSG00000244193   8.366
## ENSG00000281899   6.821
## ENSG00000224114   5.920
## ENSG00000187634   5.148
## ENSG00000160883   3.089
## ENSG00000107518   2.574
## ENSG00000267727   2.574
## ENSG00000223343   2.445
## ENSG00000170509   2.445
plot(importanceMedians3)

lets compare the values in our top gene above that is ENSG00000180818 that is the same top gene in earlier run of 50 and 100 bootstrap samples.

topGene <- genes40_DF2[order(genes40_DF2$ENSG00000180818,decreasing =T),]
topGene$ENSG00000180818
##  [1] 9.1766551 5.5757884 5.3240499 3.7774643 2.9742342 2.9142322 1.2094433
##  [8] 0.7935440 0.6286581 0.3682484 0.3130012 0.1822497

These median values above are realistic to the true population rather than outliers that can skew the data.

confusionMatrix(predictRFmedians3, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       2       1
##    myoPain       0       0
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.0943, 0.9916)
##     No Information Rate : 0.6667          
##     P-Value [Acc > NIR] : 0.7407          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value :    NaN          
##              Prevalence : 0.6667          
##          Detection Rate : 0.6667          
##    Detection Prevalence : 1.0000          
##       Balanced Accuracy : 0.5000          
##                                           
##        'Positive' Class : healthy         
## 

Seems that with 1000 bootstrap samples the top gene is the same but the last class was incorrectly identified as healthy when it was myofascial pain. The accuracy in Rstudio output with these parameters was 67% with 2 healthy classes correctly identified but one false positive of healthy class identified making the sensitivity 100% but the specificity 0% for not correctly identifying any of the myofascial classes. There was only 1 sample of myofascial and 2 of healthy so not the best population to form samples hence why bootstrap was done to simulate more randomness in a larger population with more samples.

Lets write out the top 5 features of our top model in the median top gene fold change samples of healthy vs myofascial pain.

importance50bags <- data.frame(importanceMedians[1])

genesImportant <- row.names(importance50bags)
importance50bags$gene <- genesImportant

importance50baggMedians <- importance50bags[order(importance50bags$Overall, decreasing=T),]
importance50baggMedians1 <- importance50baggMedians[,c(2,1)]
importance50baggMedians1
write.csv(importance50baggMedians1[1:5,], 'mediansImportanceFeatures50baggingTop5.csv',row.names=F)

You can find the above top 5 genes of importance in our best performing model of the median fold change values of healthy to myofascial pain here.

Now lets see how the 100 folds of cross validation parameter will do in the median samples compared to the bootstrapped samples.

control_random <- trainControl(method = "cv", number = 100)  
metric = "Accuracy"

set.seed(12345)
intrain <- sample(1:12, .8*12)
training <- genes40_DF2[intrain,]
testing <- genes40_DF2[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitMedians4 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRFmedians4 <- predict(fitMedians4, testing)
RF_df_medians4 <- data.frame(predictRFmedians4, type=testing$class)
RF_df_medians4

Using 100 folds of cross validation parameter instead of bootstrapping in Rstudio gave an output of 100% accuracy with all healthy and myofascial pain classes correctly identified.

importanceMedians4 <- varImp(fitMedians4)
importanceMedians4
## rf variable importance
## 
##   only 20 most important variables shown (out of 40)
## 
##                 Overall
## ENSG00000250978 100.000
## ENSG00000268292  85.807
## ENSG00000180818  82.284
## ENSG00000235590  43.153
## ENSG00000166676  37.416
## ENSG00000128242  31.554
## ENSG00000148677  26.269
## ENSG00000134184  23.073
## ENSG00000143816  18.334
## ENSG00000224114  10.900
## ENSG00000284686   6.909
## ENSG00000242288   5.987
## ENSG00000244193   5.912
## ENSG00000221932   4.615
## ENSG00000223343   4.116
## ENSG00000281899   3.866
## ENSG00000160883   3.866
## ENSG00000008056   3.492
## ENSG00000107518   3.417
## ENSG00000182324   2.744
plot(importanceMedians4)

Here the top gene is not ending in 180818 but instead is ENSG00000250978 and lets look at the values in this gene per sample.

topGene <- genes40_DF2[order(genes40_DF2$ENSG00000250978,decreasing =T),]
topGene$ENSG00000250978
##  [1] 72.462711 65.868877 64.451998 39.580146 36.070355 13.645273  6.011879
##  [8]  5.645763  3.889121  3.704525  3.368245  3.108345

The range is 3.1 to 72.46 with the median 13.64 between a higher value of 36.07 and lower value of 6.01.

plot(topGene$ENSG00000250978)

The healthy samples by index are the first 5 samples and the latter 7 samples are the myofascial pain samples. This gene has higher gene expression values for the healthy class and lower expression values for the myofascial pain class. It is the top feature of importance in classifying a healthy sample from a myofascial pain sample using caret’s random forest set to parameter of 100 folds of cross validation with Accuracy the metric.

confusionMatrix(predictRFmedians4, testing$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction healthy myoPain
##    healthy       2       0
##    myoPain       0       1
##                                      
##                Accuracy : 1          
##                  95% CI : (0.2924, 1)
##     No Information Rate : 0.6667     
##     P-Value [Acc > NIR] : 0.2963     
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6667     
##          Detection Rate : 0.6667     
##    Detection Prevalence : 0.6667     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : healthy    
## 

Seems like these genes could be useful in determining prediction classifiers for healthy and myofascial pain so lets save these top 5 genes to a file and get gene summaries for later.

samples of healthy vs myofascial pain.

importance100cv <- data.frame(importanceMedians4[1])

genesImportant_cv <- row.names(importance100cv)
importance100cv$gene <- genesImportant_cv

importance100cv_Medians <- importance100cv[order(importance100cv$Overall, decreasing=T),]
importance100cv_Medians1 <- importance100cv_Medians[,c(2,1)]
importance100cv_Medians1
write.csv(importance100cv_Medians1[1:5,], 'mediansImportanceFeatures100cvTop5.csv',row.names=F)

You can find the file here, that we just uploaded of top 5 genes with 100% accuracy in 100 folds of cross validation in predicting healthy or myofascial pain classes in our samples by top genes of fold change values of the medians of healthy vs medians of myofascial pain samples.

Thanks for viewing this document. Still some things to do as we found 100% accurate models in tuning parameters in 50 baggs of fold change means, 100 folds of cross validation and 100 baggs of the medians of our fold change values of healthy vs. myofascial pain in Rstudio. However, when this gets ran through knitr to publish and upload as html to rpubs it may have completely different results. Feel free to grab the google sheets data and explore for yourself with the same set.seed settings and parameters and files to see what you find.

Summaries have to be added to the genes of top 5 from best models as well to get a general idea of the features most important in separating healthy from myofascial pain samples of real people.

I have updated summaries on New Year’s Eve 2025. This part 2 of project was published with part 1 date but was done on 12-30-2025. Today the 1st of January in the year 2026 I am going to add to this file a review of the summaries of the top performing models’ top 5 genes.

Our top performing models were the means of the myofascial pain in ratio to means of healthy sample to get fold changes per gene per this ratio. We used 50 bootstrap aggregated samples in the model of random forest in caret. The 2nd model was the medians of the sample fold change per gene top 5 using 50 bootstap aggregates in the random forest model. And the third was the cross validation instead of bootstrapped samples with 100 folds.

Lets read in those models that the summaries have been added to.

means50baggsRF_top5 <- read.csv('top5GenesMyalgia_part2_summariesAdded_means50Baggs.csv', header=T, na.strings=c('',' ','na','NA'))

                                                                          medians50baggsRF_top5 <- read.csv('mediansImportanceFeatures50baggingTop5_summariesAdded.csv', header=T, na.strings=c('',' ','na','NA'))


medians100foldsCV_RF_top5 <- read.csv('mediansImportanceFeatures100cvTop5_summariesAdded.csv', header=T, na.strings=c('',' ','na','NA'))

Lets add a model identification feature to each model so we can see side by side importance of the top 5 features by which model selected it.

means50baggsRF_top5$model <- as.factor('means50baggsRF')
medians50baggsRF_top5$model <- as.factor('medians50baggsRF')
medians100foldsCV_RF_top5$model <- as.factor('medians100foldsRF')
colnames3 <- colnames(means50baggsRF_top5)

colnames(medians50baggsRF_top5) <- colnames3
colnames(medians100foldsCV_RF_top5) <- colnames3

allDF3models <- rbind(means50baggsRF_top5,medians50baggsRF_top5,medians100foldsCV_RF_top5)

allDF3models
means50RF_baggs <- means50baggsRF_top5$Gene
medians50RF_baggs <- medians50baggsRF_top5$Gene
medians100RF_cv <- medians100foldsCV_RF_top5$Gene



allGenes <- c(means50RF_baggs, medians50RF_baggs, medians100RF_cv)
allGenes
##  [1] "ENSG00000138135" "ENSG00000108342" "ENSG00000180818" "ENSG00000267368"
##  [5] "ENSG00000081041" "ENSG00000180818" "ENSG00000250978" "ENSG00000268292"
##  [9] "ENSG00000235590" "ENSG00000166676" "ENSG00000250978" "ENSG00000268292"
## [13] "ENSG00000180818" "ENSG00000235590" "ENSG00000166676"

Lets see these 15 ensemble genes by their alternate name if they have it, not all had a genecards name and so the ensemble or other name from Hugo Gene Nomenclature Committee or HGNC was used if available.

altGeneNames <- allDF3models$GeneCardsName

altGeneNames
##  [1] "CH25H"           "CSF3"            "HOXC10"          "UPK3BL1"        
##  [5] "CXCL2"           "HOXC10"          "LOC101928819"    "ENSG00000268292"
##  [9] "GNAS-AS1"        "TVP23A"          "LOC101928819"    "ENSG00000268292"
## [13] "HOXC10"          "GNAS-AS1"        "TVP23A"

The above list of 15 genes is a list of all 15 top features of the 3 models top 5 importance in classifying a healthy or fibromyalgia class. There are repeats or duplicates. Lets see those duplicate repeats.

shared <- duplicated(altGeneNames)
totalShared <- altGeneNames[shared]
totalShared
## [1] "HOXC10"          "LOC101928819"    "ENSG00000268292" "HOXC10"         
## [5] "GNAS-AS1"        "TVP23A"

The genes that are seen in multiple models as top features are the HOXC10, LOC101928819, ENSG00000268292, GNAS-AS1, and TVP23A genes. The HOXC10 gene is duplicated in the list of totalShared.

Lets see this dataframe of the shared top genes and only make one HOXC10 in the list by removing the first instance of it in the list.

SharedGenesDF <- allDF3models[allDF3models$GeneCardsName %in% totalShared[2:6],]
SharedGenesDF

The common unique genes to all 3 models’ top 5 important features total five genes of: HOXC10, LOC101928819, ENSG00000268292, GNAS-AS1, and TVP23A.

These genes are transcription factor regulator gene, three long non-coding RNA genes that maintain DNA chromatin wrapped integrity and in transcription of DNA into RNA. Another gene is a membrane protein of the organelle in the cell cytoplasm for the golgi apparatus that helps package and transport products to other cell organelles of endosomes and lysosomes for product and toxin recycling or destroying and to the plasma membrane for transportation outside the cell for body to use. This makes sense since this is gene expression with RNA data that is inside the nucleus and needs trasportation of protein products outside the cell nucleus and cytoplasm. We didn’t really find any remarkable genes other than transcription regulators in the cell that make proteins. Seems like fibromyalgia sufferers have chronic pain from nuclear cell processes that are persisting beyond injury and are chronically inflamed systematically.

The following is supposed to isolate those genes that are not shared but it brings back all genes and their duplicates from original 15 genes.

totalNotShared <- altGeneNames[!(altGeneNames %in% shared)]
totalNotShared
##  [1] "CH25H"           "CSF3"            "HOXC10"          "UPK3BL1"        
##  [5] "CXCL2"           "HOXC10"          "LOC101928819"    "ENSG00000268292"
##  [9] "GNAS-AS1"        "TVP23A"          "LOC101928819"    "ENSG00000268292"
## [13] "HOXC10"          "GNAS-AS1"        "TVP23A"

If the output is different quite possibly, let me print out the results of above code. [1] “CH25H” “CSF3” “HOXC10”
[4] “UPK3BL1” “CXCL2” “HOXC10”
[7] “LOC101928819” “ENSG00000268292” “GNAS-AS1”
[10] “TVP23A” “LOC101928819” “ENSG00000268292” [13] “HOXC10” “GNAS-AS1” “TVP23A”

You can see all genes returned. The genes that are not shared are CH25H, CSF3, CXCL2, and UPK3BL1.

I will manually see what models selected these genes and compare the results and gene summary.

notSharedDF <- allDF3models[allDF3models$GeneCardsName %in% c("CH25H", "CSF3","UPK3BL1","CXCL2"),]
notSharedDF

The first model selected all these genes of the means of samples myofascial pain over healthy means both fold change values per gene. No other model selected these genes. In this model, Accuracy was the metric, and the method was random forest in caret with a trControl of bootstrap aggregating with 50 randomized samples.

Overall these 4 genes not shared among all 3 models show CH25H increases fat metabolism and protection against enveloped viruses, CSF3 increases neutrophil granulocytes and slows hematopoeisis in the process, UPK3BL1 has a membrane protein that may be able to do what membrane proteins do is guard the entry and exit of cell molecules and products or resources via a gate or receptor mechanism, and CXCL2 is a cytokine seen at the site of injury with inflammation where activated neutrophils and macrophages produce it. That says the uncommon genes between these models as top importance features in classifying fibromyalgia vs healthy, that inflammation is present, infection is a possibility protected against with a pain enducing cytokine, WBCs are increased, and cell proliferation and differentiation is slowed in hematopoiesis of lymphocytes and RBCs while increasing WBCs of neutrophils and macrophages. The plasma membrane of cells have an increased contender with unstated cell function, while the fatty components of the plasma membrane a lipid or fat is being destroyed by increased fat metabolism that also destroys enveloped viruses. Its like the host’s body is shut down and not allowing anything close to it while it looks to destroy any new antigens or persisting ones.

All this data was taken from the fragment portion of genes in the nuclear RNA sequence data and the count data wasn’t used but available of the number of fragments per kilo million called FPKM. In the cell fragments of DNA can be abundant as it is being translated or destroyed, so there could be information lacking in the sequencing that allows certain transcription factors that allow translation of RNA into a protein to continue while having the excess RNA that is old in circulation recycled hence the golgi apparatus protein and use of lysosome and endosome functions in the common gene of TVP23A.

All genes tell a story in the body suffering myalgia pain that there is persistant inflammation from cytokine activity, slowed hematopoeisis, increased neutrophil and macrophage or monocyte activity, transcription factor long non-coding RNA activity that regulate copying of DNA to be transcribed into RNA ready for translating a protein to use by the body. And the HOXC10 gene that is a embryo development gene responsible for giving cells their anterior and posterior cell position and seen in nerve disorders when defective. The HOXC10 is common to all3 models and does encode a gene as well as is a transcription factor. Myalgia pain if greater than injury and chronic is known to have nerve deficits and a low threshold to pain same as fibromyalgia. This last part focused on the importance features to determine the genes that are important to understanding myalgia pain.

Lets look back at the original data of samples and fold change values for all these genes to see the up or down regulation of fold change values for these 15 top importance features genes.

top15DF <- allDF3models[allDF3models$GeneCardsName %in% altGeneNames,]
top15DF
colnames(top15DF)
## [1] "Gene"               "Overall"            "GeneCardsName"     
## [4] "geneCardsSummary"   "NCBI_summary"       "UniProt_summary"   
## [7] "overallGeneSummary" "chromosomeLocation" "model"
top15DF_allsamplesMeta <- merge(allDF3models, foldchangeHealthyMyoMedians, by.x='Gene', by.y='Gene')

top15DF_allsamplesMeta
colnames(top15DF_allsamplesMeta)
##  [1] "Gene"                          "Overall"                      
##  [3] "GeneCardsName"                 "geneCardsSummary"             
##  [5] "NCBI_summary"                  "UniProt_summary"              
##  [7] "overallGeneSummary"            "chromosomeLocation"           
##  [9] "model"                         "Healthy1"                     
## [11] "Healthy2"                      "Healthy3"                     
## [13] "Healthy4"                      "Healthy5"                     
## [15] "myo1"                          "myo2"                         
## [17] "myo3"                          "myo4"                         
## [19] "myo5"                          "myo6"                         
## [21] "myo7"                          "healthy_Mean"                 
## [23] "myo_Mean"                      "healthyMedian"                
## [25] "myoMedian"                     "foldchangeHealthyVsMyo"       
## [27] "foldchangeHealthyVsMyoMedians"

Lets move the model feature next to Overall feature to show the model and rating of the gene in prediction of class.

top15DF_all <- top15DF_allsamplesMeta[,c(1,2,9,3,7,26,27,10:25,4:8)]
top15DF_all

Lets write this data frame to csv file and compare information.

write.csv(top15DF_all, 'top15DF_all.csv', row.names=F)

You can find that file here.

Just looking at the fold change or side by side comparisons of means and median values of each gene by class of myo pain or healthy, we know all these genes that are affected are down regulated in gene expression. So, we can take what we know of these inflammatory markes and say there is less of that going on in these sufferers of myofascial pain.

Less cytokine activity, more hematopoeisis, less toxin and fat metabolism, less plasma membrane protein of some sort that could be a gate or receptor to let in and out products and substrates to use by the cell or take cell products such as proteins made into the body. There is also less gene expression of the gene HOXC10 that regulates cell position from the anterior-posterior axis in embryo development that when defective shows neurological disorders in its host. Less of the transcriptio regulators of RNA being copied into pre-mRNA to make a protein and regulate DNA is occuring in the cell.

So it seems that the fibromyalgia sufferers don’t have too much inflammation but have cellular transcription deficits of protein and making too many lipids and fat, not digesting enough toxins, and are just very toxic and in need of detoxing to get their cellular functions working properly.

You can go to the beginning and read all about the study that created this gene expression data and shared on NCBI by visiting their GSE series and GPL platform. Those are found at GSE215154 series data

Part 1 is uploaded already to Rpubs and this has additions to that first published markdown file so that it is longer.

Since we got great results already using Random Forest classifier in bootstrap and cross validation parameter tuning we assumed these genes are those top genes influencing myofascial pain patterns. We investigated these genes produced and found many cellular processes were downregulated and a common denominator gene of HOXC10 was found under expressed in myofascial cases and a top feature in predicting a class as healthy or of having myofascial pain. HOXC10 when defective can have neurological deficits in the body. Many other top genes were transcription regulators, down regulation of the cellular processes that detox fats and toxins from the body, and down regulation of a plasma protein that may be a cell receptor or cell gate that allows entry and exit of products and substrates into the cell and products the body needs out of the cell. The series that created these samples of FPKM and counts of RNA molecules didn’t say if these are the RNA sequencing gene expression profiles before injection of MEF2C a known pain enhancer of musculoskeletal pain or after injection of DEX to inhibit pain after the samples were injected with MEF2C, or if these samples are before administering of any exogenous substance. So, we assume with this data science project that the samples are at some state during the experiment where they are separate classes or samples from individual hosts, not the same host at different points in time, and relate to each other in whatever process such as the healthy and myofascial pain samples are each at a point in time before administering of exogenous substances, at the time of administering MEF2C, or at the time of administering DEX done after administering MEF2C. The way the study was described, makes it possible that the myoPain class is actually the only one MEF2C was administered and also administered DEX to inhibit pain, and the healthy class is that which was injected with TNF-a a pain inflammatory biomarker, and then the healthy class was administered with DEX to inhibit pain.

Epstein-Barr is a virus with associations claimed as having with sufferers of fibromyalgia though not really enough to assert it into the gold standard testing and diagnostic pathologies of Merck Epstein-Barr is also associated loosely with multiple sclerosis and strongly with mononucleosis that is strongly associated with Epstein-Barr infection. Hodgkin Lymphoma is also associated loosely with an Epstein-Barr infection at one point in life.

This is a stepping stone research into finding out if those top genes are making appearances in samples gathered from online respectable resources to compare multiple sclerosis, fibromyalgia, mononucleosis, EBV infection, and Hodgkin disease.

Thanks so much and keep checking in to see this progress.