knitr::opts_chunk$set(cache = FALSE)
This markdown file for R is about myofascial pain and taken from the GSE215154 series data that found high throughput sequencing fastp data of 5 healthy samples compared to 7 samples of patients with triggerpoint myofascial pain. A search of the GSE215154 series data at NCBI gene studies will bring you to the series.
The series is published under Homo Sapiens or humans for the gene expression profiles, but the study says it was done on rats in the text description. They injected a known pain inflammatory biomarker MEF2C or myocyte enhancer factor 2 into rats and saw increased pain AKA lowered pain threshold but then injected one of those inflammatory biomarkers in healthy samples and saw TNF-a create increased pain or lowered thresholds to pain. They then administered the inflammatory biomarker inhibitor, dexmedetomidine or DEX into both samples and saw pain threshold increase and inflammatory biomarkers decreased. These samples are what was available so uncertain if these are samples before or after injection of MEF2C that causes musculoskeletal pain or after injection of DEX to inhibit pain. But there are clear differences in the samples as the genes have clear separation in levels of gene expression in healthy vs. the myofascial pain groups.
I manually downloaded each text file of the samples and the 7 text files for trigger point myofascial pain. This was the most recent study evaluating fibromyalgia patient genomic data I found a month or so ago and am now reviewing it in searching for top genes in myofascial pain or chronic pain patterns, mononucleosis, multiple sclerosis, Hodgkins disease, and Epstein-Barr infections that relate back to the common denominator at some point of life for other pathologies listed. This data science project focuses on chronic pain in those with myofascial pain patterns similar to fibromyalgia that currently does not have any diagnostic gene or associated genes or special lab testing to confirm disease. Currently only a questionnaire is available and monitoring over about 3 months while ruling out autoimmune disorders to diagnose fibromyalgia as a source of pain out of normal threshold values. This disorder was previously an ‘in your head’ disease that females got mostly and doctors would tell them to take OTC for pain and now get lifestyle in check to improve diet, sleep, stress, exercise, massage, adjustments, and meditation.
Lets read in the sample text files separately and combine them into one data frame. The sample information says that the aquisition of the data used a count of the fragments of each gene, then normalized it, and also added in the FPKM values as a final step. These are tab delimited files. When reading in the file per sample you see the gene as well as the count (of fragments then normalized) and FPKM value per gene. Internet search says the normalization is the FPKM values for the counts of fragments in RNA-seq data of transcriptional gene data using a fragment length. The gene length is multiplied by 10 billion or 10^9= 10,000,000,000, then divided by the transcripts of gene X gene length. More Information on FPKM is fragments per kilo million and TPM is transcripts per million and CPM is counts per million.
Lets load in our packages first to use later.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
healthy1 <- read.table('GSM6623971_C1.txt', header=T, sep="")
summary(healthy1)
## gene_id C1_FPKM C1_count
## Length:58302 Min. :0.000e+00 Min. : 0.0
## Class :character 1st Qu.:0.000e+00 1st Qu.: 0.0
## Mode :character Median :0.000e+00 Median : 0.0
## Mean :5.472e+00 Mean : 197.1
## 3rd Qu.:3.379e-01 3rd Qu.: 10.0
## Max. :3.058e+04 Max. :754337.0
head(healthy1)
Read in the other 4 healthy samples as tables
healthy2 <- read.table('GSM6623972_C2.txt', header=T, sep="")
healthy3 <- read.table('GSM6623973_C3.txt', header=T, sep="")
healthy4 <- read.table('GSM6623974_C4.txt', header=T, sep="")
healthy5 <- read.table('GSM6623975_C5.txt', header=T, sep="")
Lets column bind them as they all have the same number of rows and columns, but lets remove one of the variables, the counts, and keep FPKM or fragments per kilomillion of each gene. And remove the other gene columns other than the first table.
colnames(healthy1)[2] <- 'healthy1-FPKM'
colnames(healthy2)[2] <- 'healthy2-FPKM'
colnames(healthy3)[2] <- 'healthy3-FPKM'
colnames(healthy4)[2] <- 'healthy4-FPKM'
colnames(healthy5)[2] <- 'healthy5-FPKM'
healthy5samples <- cbind(healthy1[,-3], healthy2[,-c(1,3)],healthy3[,-c(1,3)], healthy4[,-c(1,3)], healthy5[,-c(1,3)])
colnames(healthy5samples) <- c('Gene','Healthy1','Healthy2','Healthy3','Healthy4','Healthy5')
head(healthy5samples)
Lets do the same with the myofacial 7 samples.
myofascial1 <- read.table('GSM6623976_M1.txt', header=T, sep="")
myofascial2 <- read.table('GSM6623977_M2.txt', header=T, sep="")
myofascial3 <- read.table('GSM6623978_M3.txt', header=T, sep="")
myofascial4 <- read.table('GSM6623979_M4.txt', header=T, sep="")
myofascial5 <- read.table('GSM6623980_M5.txt', header=T, sep="")
myofascial6 <- read.table('GSM6623981_M6.txt', header=T, sep="")
myofascial7 <- read.table('GSM6623982_M7.txt', header=T, sep="")
colnames(myofascial1)[2] <- 'myofascial1-FPKM'
colnames(myofascial2)[2] <- 'myofascial2-FPKM'
colnames(myofascial3)[2] <- 'myofascial3-FPKM'
colnames(myofascial4)[2] <- 'myofascial4-FPKM'
colnames(myofascial5)[2] <- 'myofascial5-FPKM'
colnames(myofascial6)[2] <- 'myofascial6-FPKM'
colnames(myofascial7)[2] <- 'myofascial7-FPKM'
myofascial7samples <- cbind(myofascial1[,-3], myofascial2[,-c(1,3)],myofascial3[,-c(1,3)], myofascial4[,-c(1,3)], myofascial5[,-c(1,3)], myofascial6[,-c(1,3)], myofascial7[,-c(1,3)])
colnames(myofascial7samples) <- c('Gene','myo1','myo2','myo3','myo4','myo5','myo6','myo7')
head(myofascial7samples)
Lets column bind the healthy 5 samples to the 7 myofascial pain samples with the Gene order they all have the same number of rows of.
dataHealthy5Myo7 <- cbind(healthy5samples,myofascial7samples[,-1])
head(dataHealthy5Myo7)
Summarize the data frame just made to find the mean of each sample’s normalized FPKM values within the 58,302 genes.
summary(dataHealthy5Myo7)
## Gene Healthy1 Healthy2 Healthy3
## Length:58302 Min. :0.000e+00 Min. :0.000e+00 Min. : 0.000
## Class :character 1st Qu.:0.000e+00 1st Qu.:0.000e+00 1st Qu.: 0.000
## Mode :character Median :0.000e+00 Median :0.000e+00 Median : 0.000
## Mean :5.472e+00 Mean :5.188e+00 Mean : 5.447
## 3rd Qu.:3.379e-01 3rd Qu.:6.688e-01 3rd Qu.: 0.464
## Max. :3.058e+04 Max. :2.376e+04 Max. :50031.764
## Healthy4 Healthy5 myo1
## Min. : 0.000 Min. : 0.000 Min. :0.000e+00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:0.000e+00
## Median : 0.000 Median : 0.000 Median :0.000e+00
## Mean : 5.764 Mean : 5.556 Mean :4.861e+00
## 3rd Qu.: 0.426 3rd Qu.: 0.467 3rd Qu.:7.898e-01
## Max. :37775.449 Max. :39829.283 Max. :1.542e+04
## myo2 myo3 myo4
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000 Median : 0.000
## Mean : 5.766 Mean : 5.738 Mean : 5.113
## 3rd Qu.: 0.443 3rd Qu.: 0.309 3rd Qu.: 0.376
## Max. :43180.511 Max. :40626.886 Max. :37108.454
## myo5 myo6 myo7
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000 Median : 0.000
## Mean : 5.415 Mean : 5.175 Mean : 5.864
## 3rd Qu.: 0.341 3rd Qu.: 0.371 3rd Qu.: 0.377
## Max. :41750.951 Max. :37369.240 Max. :41902.286
dim(dataHealthy5Myo7)
## [1] 58302 13
write.csv(dataHealthy5Myo7, 'fragmentsPerKMill_myofascial_5healthy_7myo_58kgenes.csv',row.names=F)
Get this file we made here file
Lets get the sample mean per gene in healthy and myofascial pain samples to compare their fold change values.
healthyMeans <- dataHealthy5Myo7 %>% group_by(Gene) %>%
mutate(
healthy_Mean = mean(Healthy1:Healthy2,na.rm=T),
myo_Mean = mean(myo1:myo7)
)
colnames(healthyMeans)
## [1] "Gene" "Healthy1" "Healthy2" "Healthy3" "Healthy4"
## [6] "Healthy5" "myo1" "myo2" "myo3" "myo4"
## [11] "myo5" "myo6" "myo7" "healthy_Mean" "myo_Mean"
Now get the Median of each sample group.
healthyMeansMedians <- healthyMeans %>% group_by(Gene) %>% mutate(healthyMedian=median(Healthy1:Healthy5, na.rm=T), myoMedian = median(myo1:myo7))
colnames(healthyMeansMedians)
## [1] "Gene" "Healthy1" "Healthy2" "Healthy3"
## [5] "Healthy4" "Healthy5" "myo1" "myo2"
## [9] "myo3" "myo4" "myo5" "myo6"
## [13] "myo7" "healthy_Mean" "myo_Mean" "healthyMedian"
## [17] "myoMedian"
head(healthyMeansMedians)
Looks like the median is the same for mean in the three samples from the first 5 entries in the data.
*** Means of Healthy Vs Myofascial Pain Data Frame Top 40 Genes ***
Lets get the fold change of myofascial pain to healthy gene expression FPKM normalized values.
foldchangeHealthyMyo <- healthyMeansMedians %>% mutate(foldchangeHealthyVsMyo=(myo_Mean/(healthy_Mean+0.0000000000001)))
head(foldchangeHealthyMyo)
To avoid the NaN values of healthy 0 in denominator, I added a very very small value of one trillionth of 1. Now lets order the data by fold change from most to least and take the top 20 and bottom 20 genes and make a data frame to do some machine learning on with our 2 classes of healthy and myofascial pain.
I want to add in a Median fold change column as well.
foldchangeHealthyMyoMedians <- foldchangeHealthyMyo %>% mutate(foldchangeHealthyVsMyoMedians=(myoMedian/(healthyMedian+0.0000000000001)))
head(foldchangeHealthyMyoMedians)
FC_ordered <- foldchangeHealthyMyo[order(foldchangeHealthyMyo$foldchangeHealthyVsMyo, decreasing=T),]
write.csv(FC_ordered, 'meansFC_healthyVsMyoAllGenes.csv', row.names=F)
Get the file of all genes ordered greatest to least in fold change by means of healthy versus the myofascial pain sample means here.
head(FC_ordered)
tail(FC_ordered)
There may be a very vast number of genes with a zero value, so the bottom 20 won’t be revealing anything for the gene expression down regulation in myofascial pain.
sum(FC_ordered$foldchangeHealthyVsMyo==0)
## [1] 32916
There are 32,916 genes equal to zero in fold change. Since these values are normalized fragments per kilomillion, we will have to look at the bottom values above 0 that are lowest.
sum(FC_ordered$foldchangeHealthyVsMyo < 0)
## [1] 0
There are not any negative values. but more than 32,000 genes with 0 fold change, won’t give much information if ordering in decreasing order and last 32,000 genes have no information for fold change.
minFC <- subset(FC_ordered, FC_ordered$foldchangeHealthyVsMyo > 0)
write.csv(minFC, 'FC_ordered_means_zeroFCs_removed.csv', row.names=F)
tail(minFC)
We removed those 32,000 genes that had no information to add in fold change values for this RNA-seq data, and still have 25,386 genes left. We can now take the bottom 20 and top 20 to make a data frame to use in training a model to predict the best in our 2 classes of healthy or myofascial pain. You can get the means table of all genes with values in fold change by means of samples healthy versus myofascial pain here.
minMax20_FC <- minFC[c(1:20,25367:25386),]
dim(minMax20_FC)
## [1] 40 18
That gives us 40 genes to work with. Lets see those genes.These are Ensemble genes. They may or may not have alternate Genecard.org gene names but should have a summary on Genecards.org that we can read all about once we get our top genes.
minMax20_FC$Gene
## [1] "ENSG00000267727" "ENSG00000178372" "ENSG00000172867" "ENSG00000188508"
## [5] "ENSG00000178934" "ENSG00000167768" "ENSG00000114771" "ENSG00000095970"
## [9] "ENSG00000267368" "ENSG00000203782" "ENSG00000189001" "ENSG00000160883"
## [13] "ENSG00000271367" "ENSG00000284627" "ENSG00000204421" "ENSG00000248871"
## [17] "ENSG00000203618" "ENSG00000243466" "ENSG00000105427" "ENSG00000186458"
## [21] "ENSG00000281899" "ENSG00000235590" "ENSG00000125740" "ENSG00000180818"
## [25] "ENSG00000244193" "ENSG00000110848" "ENSG00000081041" "ENSG00000143816"
## [29] "ENSG00000136457" "ENSG00000134184" "ENSG00000163734" "ENSG00000242288"
## [33] "ENSG00000108342" "ENSG00000138135" "ENSG00000107518" "ENSG00000148677"
## [37] "ENSG00000205362" "ENSG00000007908" "ENSG00000255730" "ENSG00000136244"
write.csv(minMax20_FC, 'min20max20geneFoldchangeMyofascialPain.csv', row.names=F)
Get the file we just wrote here
colnames(minMax20_FC)
## [1] "Gene" "Healthy1" "Healthy2"
## [4] "Healthy3" "Healthy4" "Healthy5"
## [7] "myo1" "myo2" "myo3"
## [10] "myo4" "myo5" "myo6"
## [13] "myo7" "healthy_Mean" "myo_Mean"
## [16] "healthyMedian" "myoMedian" "foldchangeHealthyVsMyo"
Make the data frame but keep gene names in a character vector to make header in new transposed data frame to get samples as rows and the 40 genes as the features and add in a class feature for the sample class where we only have 12 samples of 5 healthy and 7 myofascial.
geneNames <- minMax20_FC$Gene
genes40_DF <- data.frame(t(minMax20_FC[,c(2:13)]))
colnames(genes40_DF) <- geneNames
genes40_DF$class <- c('healthy','healthy','healthy','healthy','healthy','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain')
head(genes40_DF)
Change the class from character to factor to allow confusionMatrix function to work later in this program.
genes40_DF$class <- as.factor(genes40_DF$class)
summary(genes40_DF)
## ENSG00000267727 ENSG00000178372 ENSG00000172867 ENSG00000188508
## Min. : 0.00000 Min. : 0.0000 Min. : 0.00000 Min. : 0.000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.000
## Median : 0.00000 Median : 0.0000 Median : 0.00000 Median : 0.000
## Mean : 16.50957 Mean : 7.4553 Mean : 3.90308 Mean : 3.192
## 3rd Qu.: 0.09357 3rd Qu.: 0.1931 3rd Qu.: 0.01573 3rd Qu.: 0.000
## Max. :196.66379 Max. :79.0971 Max. :41.40356 Max. :30.425
## ENSG00000178934 ENSG00000167768 ENSG00000114771 ENSG00000095970
## Min. : 0.00000 Min. : 0.0000 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 0.00000 Median : 0.0000 Median : 0.000 Median : 0.0000
## Mean : 2.97886 Mean : 2.6479 Mean : 1.733 Mean : 1.5420
## 3rd Qu.: 0.04052 3rd Qu.: 0.3158 3rd Qu.: 0.000 3rd Qu.: 0.0117
## Max. :26.20979 Max. :24.8739 Max. :19.873 Max. :18.0198
## ENSG00000267368 ENSG00000203782 ENSG00000189001 ENSG00000160883
## Min. : 0.0000 Min. : 0.00 Min. :0.000 Min. :0.00000
## 1st Qu.: 0.2323 1st Qu.: 0.00 1st Qu.:0.000 1st Qu.:0.07218
## Median : 3.9112 Median : 0.00 Median :0.000 Median :0.15182
## Mean : 3.8415 Mean : 1.25 Mean :1.088 Mean :1.04723
## 3rd Qu.: 4.8437 3rd Qu.: 0.00 3rd Qu.:0.000 3rd Qu.:0.35011
## Max. :12.8966 Max. :12.68 Max. :9.905 Max. :9.61607
## ENSG00000271367 ENSG00000284627 ENSG00000204421 ENSG00000248871
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.3512 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.3838 Median :0.0000 Median :0.0000
## Mean :0.8291 Mean :0.9975 Mean :0.6937 Mean :0.9914
## 3rd Qu.:0.1984 3rd Qu.:0.6804 3rd Qu.:0.0000 3rd Qu.:0.4922
## Max. :8.9279 Max. :6.4229 Max. :7.2011 Max. :7.5098
## ENSG00000203618 ENSG00000243466 ENSG00000105427 ENSG00000186458
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.09635 Median :0.00000
## Mean :0.6235 Mean :0.6009 Mean :0.62817 Mean :0.55108
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.34734 3rd Qu.:0.01939
## Max. :7.4821 Max. :6.4375 Max. :5.37288 Max. :6.37223
## ENSG00000281899 ENSG00000235590 ENSG00000125740 ENSG00000180818
## Min. : 0.2017 Min. :0.01404 Min. :0.00000 Min. :0.1822
## 1st Qu.: 1.2082 1st Qu.:0.34122 1st Qu.:0.04294 1st Qu.:0.5636
## Median : 2.1032 Median :0.46236 Median :0.06920 Median :2.0618
## Mean : 3.2158 Mean :0.58643 Mean :0.49287 Mean :2.7698
## 3rd Qu.: 4.0849 3rd Qu.:0.83383 3rd Qu.:0.38365 3rd Qu.:4.1641
## Max. :10.3186 Max. :1.39007 Max. :4.14229 Max. :9.1767
## ENSG00000244193 ENSG00000110848 ENSG00000081041 ENSG00000143816
## Min. :0.1126 Min. :0.00000 Min. : 0.4343 Min. :0.03081
## 1st Qu.:0.1698 1st Qu.:0.00000 1st Qu.: 1.5015 1st Qu.:0.31764
## Median :0.2284 Median :0.04472 Median : 3.0891 Median :0.41963
## Mean :1.8863 Mean :0.30633 Mean : 16.2635 Mean :0.50944
## 3rd Qu.:4.3612 3rd Qu.:0.18548 3rd Qu.: 10.9835 3rd Qu.:0.56658
## Max. :6.6017 Max. :2.60310 Max. :140.2944 Max. :1.20174
## ENSG00000136457 ENSG00000134184 ENSG00000163734 ENSG00000242288
## Min. : 0.1767 Min. :0.2122 Min. :0.05608 Min. :0.01604
## 1st Qu.: 0.4568 1st Qu.:0.2489 1st Qu.:0.07749 1st Qu.:0.20070
## Median : 0.7175 Median :0.3456 Median :0.13468 Median :0.47441
## Mean : 3.8902 Mean :3.1536 Mean :0.66453 Mean :0.66045
## 3rd Qu.: 1.3671 3rd Qu.:7.1168 3rd Qu.:0.22441 3rd Qu.:0.94741
## Max. :23.7187 Max. :8.9442 Max. :6.20963 Max. :2.08076
## ENSG00000108342 ENSG00000138135 ENSG00000107518 ENSG00000148677
## Min. : 0.0000 Min. : 0.00000 Min. :0.00000 Min. : 0.8035
## 1st Qu.: 0.1687 1st Qu.: 0.06229 1st Qu.:0.09542 1st Qu.: 3.2329
## Median : 0.3364 Median : 0.21163 Median :0.16875 Median : 4.4409
## Mean : 2.8969 Mean : 2.60864 Mean :0.27688 Mean : 28.7645
## 3rd Qu.: 0.5461 3rd Qu.: 0.33091 3rd Qu.:0.44621 3rd Qu.: 11.0251
## Max. :31.1028 Max. :29.18270 Max. :0.93217 Max. :216.5660
## ENSG00000205362 ENSG00000007908 ENSG00000255730 ENSG00000136244
## Min. : 0.0000 Min. : 0.00000 Min. : 0.00000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.01814 1st Qu.: 0.0838
## Median : 0.4034 Median : 0.04708 Median : 0.03505 Median : 0.1816
## Mean : 13.3179 Mean : 2.81096 Mean : 1.58185 Mean : 3.7752
## 3rd Qu.: 0.6260 3rd Qu.: 0.40651 3rd Qu.: 0.05590 3rd Qu.: 0.2399
## Max. :155.6212 Max. :31.90556 Max. :18.62400 Max. :43.7147
## class
## healthy:5
## myoPain:7
##
##
##
##
Lets just make sure we have the same number in each class sample and only 2 classes say if a typo or extra space was added in making class factor vector.
genes40_DF %>% group_by(class) %>% count(class)
There are only 2 classes and same number in each class that can be inspected visually to see all healthy at top and all myoPain at bottom.
*** Medians of Fold change samples Healthy vs Myofascial Pain Top 40 Genes ***
Lets now make the Medians fold change ordered and get top 40 of the bottom 20 and top 20 genes.
FC_ordered_Medians <- foldchangeHealthyMyoMedians[order(foldchangeHealthyMyoMedians$foldchangeHealthyVsMyoMedians, decreasing=T),]
head(FC_ordered_Medians)
tail(FC_ordered_Medians)
write.csv(FC_ordered_Medians, 'Medians_healthyVsMyo_Foldchanges.csv', row.names=F)
You can get the huge file ordered greatest to least by Median fold change of healthy vs myofascial pain samples here.
sum(FC_ordered_Medians$foldchangeHealthyVsMyoMedians==0)
## [1] 32916
There are 32,916 zeros in the fold change by median values of healthy vs. myofascial pain. Lets remove those and take the top and bottom 20 genes to get a table of 40 genes as our top genes to work with and run machine learning algorithms on.
FC_MediansNoZeros <- subset(FC_ordered_Medians[FC_ordered_Medians$foldchangeHealthyVsMyoMedians>0,])
write.csv(FC_MediansNoZeros,'FCs_Medians_NoZeros.csv', row.names=F)
dim(FC_MediansNoZeros)
## [1] 25386 19
We removed those genes with zero fold change values from the medians of samples healthy vs myofascial pain to get a data frame of 25,386 genes. You can get the table of non zero value fold change by Median genes here
Lets now get those top 20 and bottom 20 genes from the list of genes.
top20bottom20MedianGenes <- FC_MediansNoZeros[c(1:20,25367:25386),]
summary(top20bottom20MedianGenes)
## Gene Healthy1 Healthy2 Healthy3
## Length:40 Min. : 0.000 Min. : 0.00000 Min. : 0.0000
## Class :character 1st Qu.: 0.000 1st Qu.: 0.05743 1st Qu.: 0.0000
## Mode :character Median : 0.000 Median : 0.50037 Median : 0.1177
## Mean : 2.766 Mean : 8.30262 Mean : 2.7103
## 3rd Qu.: 0.693 3rd Qu.: 2.27550 3rd Qu.: 0.3678
## Max. :72.463 Max. :216.56605 Max. :64.4520
## Healthy4 Healthy5 myo1 myo2
## Min. : 0.00000 Min. : 0.0000 Min. :8.795e-03 Min. : 0.00000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.:3.585e-02 1st Qu.: 0.04465
## Median : 0.08485 Median : 0.1202 Median :6.316e+00 Median : 0.33300
## Mean : 1.85268 Mean : 4.3458 Mean :1.608e+01 Mean : 1.54796
## 3rd Qu.: 0.49532 3rd Qu.: 0.4465 3rd Qu.:1.984e+01 3rd Qu.: 1.08150
## Max. :39.58015 Max. :69.9603 Max. :1.967e+02 Max. :10.75277
## myo3 myo4 myo5 myo6
## Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.0946 Median : 0.1029 Median : 0.1156 Median : 0.0949
## Mean : 1.4023 Mean : 2.2102 Mean : 1.2225 Mean : 2.2560
## 3rd Qu.: 0.3863 3rd Qu.: 0.5392 3rd Qu.: 0.5261 3rd Qu.: 0.7664
## Max. :20.7971 Max. :41.7730 Max. :21.5034 Max. :47.1921
## myo7 healthy_Mean myo_Mean healthyMedian
## Min. : 0.0000 Min. : 0.0000 Min. : 0.008795 Min. : 0.00000
## 1st Qu.: 0.0424 1st Qu.: 0.0000 1st Qu.: 0.035852 1st Qu.: 0.00000
## Median : 0.3253 Median : 0.4684 Median : 4.316198 Median : 0.08199
## Mean : 1.9863 Mean : 5.4656 Mean : 9.134233 Mean : 3.56562
## 3rd Qu.: 1.6686 3rd Qu.: 1.2626 3rd Qu.:10.467134 3rd Qu.: 0.77035
## Max. :20.6234 Max. :110.5132 Max. :99.163787 Max. :69.46271
## myoMedian foldchangeHealthyVsMyo foldchangeHealthyVsMyoMedians
## Min. : 0.008795 Min. :0.000e+00 Min. :0.000e+00
## 1st Qu.: 0.035852 1st Qu.:0.000e+00 1st Qu.:0.000e+00
## Median : 4.316198 Median :5.000e+00 Median :2.464e+13
## Mean : 9.134233 Mean :6.653e+13 Mean :8.954e+13
## 3rd Qu.:10.467134 3rd Qu.:5.938e+13 3rd Qu.:1.047e+14
## Max. :99.163787 Max. :9.916e+14 Max. :9.916e+14
dim(top20bottom20MedianGenes)
## [1] 40 19
colnames(top20bottom20MedianGenes)
## [1] "Gene" "Healthy1"
## [3] "Healthy2" "Healthy3"
## [5] "Healthy4" "Healthy5"
## [7] "myo1" "myo2"
## [9] "myo3" "myo4"
## [11] "myo5" "myo6"
## [13] "myo7" "healthy_Mean"
## [15] "myo_Mean" "healthyMedian"
## [17] "myoMedian" "foldchangeHealthyVsMyo"
## [19] "foldchangeHealthyVsMyoMedians"
Lets make the data frame of the median fold change values top and bottom 20 genes.
geneNames2 <- top20bottom20MedianGenes$Gene
genes40_DF2 <- data.frame(t(top20bottom20MedianGenes[,c(2:13)]))
colnames(genes40_DF2) <- geneNames2
genes40_DF2$class <- c('healthy','healthy','healthy','healthy','healthy','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain')
head(genes40_DF2)
Change the class from character to factor to allow confusionMatrix function to work later in this program.
genes40_DF2$class <- as.factor(genes40_DF2$class)
summary(genes40_DF2)
## ENSG00000267727 ENSG00000178372 ENSG00000172867 ENSG00000268292
## Min. : 0.00000 Min. : 0.0000 Min. : 0.00000 Min. : 0.00
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.:10.41
## Median : 0.00000 Median : 0.0000 Median : 0.00000 Median :17.72
## Mean : 16.50957 Mean : 7.4553 Mean : 3.90308 Mean :18.63
## 3rd Qu.: 0.09357 3rd Qu.: 0.1931 3rd Qu.: 0.01573 3rd Qu.:22.45
## Max. :196.66379 Max. :79.0971 Max. :41.40356 Max. :47.19
## ENSG00000188508 ENSG00000178934 ENSG00000221932 ENSG00000167768
## Min. : 0.000 Min. : 0.00000 Min. : 0.00000 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.00000 1st Qu.: 0.06286 1st Qu.: 0.0000
## Median : 0.000 Median : 0.00000 Median : 0.18006 Median : 0.0000
## Mean : 3.192 Mean : 2.97886 Mean : 3.37609 Mean : 2.6479
## 3rd Qu.: 0.000 3rd Qu.: 0.04052 3rd Qu.: 0.89296 3rd Qu.: 0.3158
## Max. :30.425 Max. :26.20979 Max. :33.05159 Max. :24.8739
## ENSG00000088882 ENSG00000114771 ENSG00000170509 ENSG00000095970
## Min. : 0.00000 Min. : 0.000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.00000 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.07849 Median : 0.000 Median : 0.1921 Median : 0.0000
## Mean : 2.38692 Mean : 1.733 Mean : 1.8987 Mean : 1.5420
## 3rd Qu.: 0.24971 3rd Qu.: 0.000 3rd Qu.: 0.2760 3rd Qu.: 0.0117
## Max. :26.57517 Max. :19.873 Max. :19.8319 Max. :18.0198
## ENSG00000150594 ENSG00000215472 ENSG00000203782 ENSG00000189001
## Min. : 0.00000 Min. : 0.000 Min. : 0.00 Min. :0.000
## 1st Qu.: 0.01805 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.:0.000
## Median : 0.08601 Median : 0.000 Median : 0.00 Median :0.000
## Mean : 1.57281 Mean : 2.637 Mean : 1.25 Mean :1.088
## 3rd Qu.: 0.21670 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.:0.000
## Max. :16.39539 Max. :16.261 Max. :12.68 Max. :9.905
## ENSG00000261011 ENSG00000254418 ENSG00000160883 ENSG00000271367
## Min. : 0.0000 Min. : 0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.07218 1st Qu.:0.0000
## Median : 0.1714 Median : 0.0000 Median :0.15182 Median :0.0000
## Mean : 1.3674 Mean : 1.1150 Mean :1.04723 Mean :0.8291
## 3rd Qu.: 0.5330 3rd Qu.: 0.1554 3rd Qu.:0.35011 3rd Qu.:0.1984
## Max. :11.3877 Max. :10.7022 Max. :9.61607 Max. :8.9279
## ENSG00000223343 ENSG00000128242 ENSG00000284686 ENSG00000281899
## Min. : 0.0000 Min. :0.009605 Min. :0.00000 Min. : 0.2017
## 1st Qu.: 0.0000 1st Qu.:0.047588 1st Qu.:0.01057 1st Qu.: 1.2082
## Median : 0.2482 Median :0.079693 Median :0.09984 Median : 2.1032
## Mean : 4.2921 Mean :0.072940 Mean :0.23689 Mean : 3.2158
## 3rd Qu.: 8.9518 3rd Qu.:0.087743 3rd Qu.:0.44291 3rd Qu.: 4.0849
## Max. :13.2922 Max. :0.163985 Max. :0.75737 Max. :10.3186
## ENSG00000244193 ENSG00000250978 ENSG00000158445 ENSG00000187634
## Min. :0.1126 Min. : 3.108 Min. :0.03338 Min. :0.01168
## 1st Qu.:0.1698 1st Qu.: 3.843 1st Qu.:0.32956 1st Qu.:0.03209
## Median :0.2284 Median : 9.829 Median :0.52626 Median :0.10846
## Mean :1.8863 Mean :26.484 Mean :0.57346 Mean :0.51485
## 3rd Qu.:4.3612 3rd Qu.:45.798 3rd Qu.:0.84399 3rd Qu.:1.10092
## Max. :6.6017 Max. :72.463 Max. :1.17963 Max. :1.43221
## ENSG00000166676 ENSG00000148677 ENSG00000008056 ENSG00000182324
## Min. :0.01007 Min. : 0.8035 Min. :0.01939 Min. :0.008795
## 1st Qu.:0.05938 1st Qu.: 3.2329 1st Qu.:0.18026 1st Qu.:0.060531
## Median :0.19850 Median : 4.4409 Median :0.28265 Median :0.127856
## Mean :0.15357 Mean : 28.7645 Mean :0.30978 Mean :0.172201
## 3rd Qu.:0.21901 3rd Qu.: 11.0251 3rd Qu.:0.39815 3rd Qu.:0.217277
## Max. :0.24314 Max. :216.5660 Max. :0.86260 Max. :0.611513
## ENSG00000134184 ENSG00000180818 ENSG00000224114 ENSG00000273000
## Min. :0.2122 Min. :0.1822 Min. : 0.0000 Min. :0.00000
## 1st Qu.:0.2489 1st Qu.:0.5636 1st Qu.: 0.2082 1st Qu.:0.00000
## Median :0.3456 Median :2.0618 Median : 0.7615 Median :0.06005
## Mean :3.1536 Mean :2.7698 Mean : 3.6193 Mean :0.21045
## 3rd Qu.:7.1168 3rd Qu.:4.1641 3rd Qu.: 6.3297 3rd Qu.:0.34091
## Max. :8.9442 Max. :9.1767 Max. :11.8816 Max. :1.15551
## ENSG00000235590 ENSG00000143816 ENSG00000242288 ENSG00000107518
## Min. :0.01404 Min. :0.03081 Min. :0.01604 Min. :0.00000
## 1st Qu.:0.34122 1st Qu.:0.31764 1st Qu.:0.20070 1st Qu.:0.09542
## Median :0.46236 Median :0.41963 Median :0.47441 Median :0.16875
## Mean :0.58643 Mean :0.50944 Mean :0.66045 Mean :0.27688
## 3rd Qu.:0.83383 3rd Qu.:0.56658 3rd Qu.:0.94741 3rd Qu.:0.44621
## Max. :1.39007 Max. :1.20174 Max. :2.08076 Max. :0.93217
## class
## healthy:5
## myoPain:7
##
##
##
##
Lets just make sure we have the same number in each class sample and only 2 classes say if a typo or extra space was added in making class factor vector.
genes40_DF2 %>% group_by(class) %>% count(class)
*** Compare the top 40 genes of fold change in Means Vs Medians of samples in Healthy Vs. Myofascial Pain ***
This is the 2nd part to the fibromyalgia markdown in R project and we added the medians to compare to the top genes by medians to weed out the outliers skewing results. But kept the means data in this file to compare with the medians results for top genes.
means40genes <- colnames(genes40_DF)[1:40]
medians40genes <- colnames(genes40_DF2)[1:40]
means40genesOrdered <- means40genes[order(means40genes, decreasing=T)]
means40genesOrdered
## [1] "ENSG00000284627" "ENSG00000281899" "ENSG00000271367" "ENSG00000267727"
## [5] "ENSG00000267368" "ENSG00000255730" "ENSG00000248871" "ENSG00000244193"
## [9] "ENSG00000243466" "ENSG00000242288" "ENSG00000235590" "ENSG00000205362"
## [13] "ENSG00000204421" "ENSG00000203782" "ENSG00000203618" "ENSG00000189001"
## [17] "ENSG00000188508" "ENSG00000186458" "ENSG00000180818" "ENSG00000178934"
## [21] "ENSG00000178372" "ENSG00000172867" "ENSG00000167768" "ENSG00000163734"
## [25] "ENSG00000160883" "ENSG00000148677" "ENSG00000143816" "ENSG00000138135"
## [29] "ENSG00000136457" "ENSG00000136244" "ENSG00000134184" "ENSG00000125740"
## [33] "ENSG00000114771" "ENSG00000110848" "ENSG00000108342" "ENSG00000107518"
## [37] "ENSG00000105427" "ENSG00000095970" "ENSG00000081041" "ENSG00000007908"
medians40genesOrdered <- medians40genes[order(medians40genes, decreasing=T)]
medians40genesOrdered
## [1] "ENSG00000284686" "ENSG00000281899" "ENSG00000273000" "ENSG00000271367"
## [5] "ENSG00000268292" "ENSG00000267727" "ENSG00000261011" "ENSG00000254418"
## [9] "ENSG00000250978" "ENSG00000244193" "ENSG00000242288" "ENSG00000235590"
## [13] "ENSG00000224114" "ENSG00000223343" "ENSG00000221932" "ENSG00000215472"
## [17] "ENSG00000203782" "ENSG00000189001" "ENSG00000188508" "ENSG00000187634"
## [21] "ENSG00000182324" "ENSG00000180818" "ENSG00000178934" "ENSG00000178372"
## [25] "ENSG00000172867" "ENSG00000170509" "ENSG00000167768" "ENSG00000166676"
## [29] "ENSG00000160883" "ENSG00000158445" "ENSG00000150594" "ENSG00000148677"
## [33] "ENSG00000143816" "ENSG00000134184" "ENSG00000128242" "ENSG00000114771"
## [37] "ENSG00000107518" "ENSG00000095970" "ENSG00000088882" "ENSG00000008056"
meansMediansSame <- subset(means40genes, means40genes==medians40genes)
meansMediansSame
## [1] "ENSG00000267727" "ENSG00000178372" "ENSG00000172867" "ENSG00000244193"
There are only 4 genes in the top 40 genes of the means of samples healthy vs. myofascial pain that are the same as the genes top 40 of medians of those samples. Lets get these genes in their own file data frame to get summaries of and continue to run machine learning on the means and then the medians of the top 40 genes that are actually the top 20 over expressed and top 20 under expressed genes in a fold change comparison of myofascial pain to healthy by mean of samples and then median of samples.
common4genes <- top20bottom20MedianGenes[top20bottom20MedianGenes$Gene %in% meansMediansSame,]
write.csv(common4genes, 'genes4CommonMediansMeansFCs.csv', row.names=F)
common4genes
You can get the file that we just wrote out here. I will add the alternate Genecards symbol and summaries with summaries of 3 main summaries paraphrased to generalize what the gene does or its associations in the human body.
We are going to do machine learning on the means data frame of top 40 genes with 3 runs for bootstrap with 50, 100, and 1000 samples. Then we will do the same computations on the data of median genes to see which classifies the best the realistic unskewed data (medians) or the means data that doesn’t assume proportions are incrementally the same and outliers are allowed.
*** Machine Learning on Means of Samples Fold Change top 40 genes ***
Now we can start using some modeling algorithms in the caret package we installed earlier. Lets use boot strapping in this model instead of cross validation as we have to simulate randomness and more samples from independently and identically distributing with replacement. We will use a classifier for ensemble as well as linear modeling.
control_random <- trainControl(method = "boot", number = 50)
metric = "Accuracy"
set.seed(12345)
intrain <- sample(1:12, .8*12)
training <- genes40_DF[intrain,]
testing <- genes40_DF[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitRF <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF <- predict(fitRF, testing)
RF_df <- data.frame(predictRF, type=testing$class)
RF_df
We can see above the model predicted all the myoPain class and only half the healthy class correctly with 2/3 of data correctly predicted.
importance <- varImp(fitRF)
importance
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000138135 100.000
## ENSG00000108342 92.336
## ENSG00000180818 91.058
## ENSG00000267368 86.496
## ENSG00000081041 84.854
## ENSG00000136457 49.088
## ENSG00000235590 38.139
## ENSG00000163734 20.803
## ENSG00000007908 20.073
## ENSG00000143816 19.526
## ENSG00000148677 18.978
## ENSG00000136244 18.248
## ENSG00000134184 16.241
## ENSG00000110848 16.241
## ENSG00000255730 13.869
## ENSG00000205362 12.774
## ENSG00000281899 10.584
## ENSG00000125740 10.219
## ENSG00000244193 8.759
## ENSG00000107518 6.569
plot(importance)
lets compare the values in our top gene above that is ENSG00000081041.
We can see with our small sample data of 12 samples that there is one
sample that is an outlier and much higher than the rest of the
samples.
topGene <- genes40_DF[order(genes40_DF$ENSG00000081041,decreasing =T),]
topGene$ENSG00000081041
## [1] 140.2943774 15.3743775 13.3465208 10.1957661 4.1482004 3.6057617
## [7] 2.5723394 1.7667014 1.6710862 0.9928994 0.7591185 0.4343249
Most all samples are under 20 except one and its in the healthy sample set. Maybe we should use the median values as those are realistic to the true population rather than outliers that can skew the data.That could be another R markdown project to do later.
confusionMatrix(predictRF, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 0 0
## myoPain 2 1
##
## Accuracy : 0.3333
## 95% CI : (0.0084, 0.9057)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.9630
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.3333
## Prevalence : 0.6667
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : healthy
##
Our model using bootstrap in random forest an ensemble classifier and regression algorithm scored pretty well in the 1st part of this project at 67% accuracy. There were limited samples here but bootstrapping helps to get a natural set of samples using the Central limit Theorem and other principles from econometrics. But here only 1 of the 3 samples was corretly identified with an accuracy of 33% which is not very good at all.
Lets quickly test how increasing to 100 bootstrap samples will change the metric results.
control_random <- trainControl(method = "boot", number = 100)
metric = "Accuracy"
set.seed(567)
fitRF2 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF2 <- predict(fitRF2, testing)
RF_df2 <- data.frame(predictRF2, type=testing$class)
RF_df2
importance2 <- varImp(fitRF2)
importance2
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000108342 100.000
## ENSG00000081041 85.236
## ENSG00000138135 78.691
## ENSG00000180818 75.951
## ENSG00000267368 63.927
## ENSG00000136457 44.140
## ENSG00000235590 31.507
## ENSG00000143816 25.266
## ENSG00000148677 18.113
## ENSG00000134184 17.504
## ENSG00000163734 14.003
## ENSG00000255730 12.024
## ENSG00000136244 11.720
## ENSG00000110848 11.263
## ENSG00000125740 7.915
## ENSG00000244193 6.697
## ENSG00000205362 5.936
## ENSG00000007908 5.936
## ENSG00000248871 3.044
## ENSG00000107518 2.740
plot(importance2)
confusionMatrix(predictRF2, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 1 0
## myoPain 1 1
##
## Accuracy : 0.6667
## 95% CI : (0.0943, 0.9916)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.7407
##
## Kappa : 0.4
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.5000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.5000
## Prevalence : 0.6667
## Detection Rate : 0.3333
## Detection Prevalence : 0.3333
## Balanced Accuracy : 0.7500
##
## 'Positive' Class : healthy
##
The results were the same using 100 bootstrapped samples vs 50 as the first part of this project’s separate earlier posting of 67%, and better here on 2nd run with 100 bootstrapped samples rather than 50 as last run done. Lets try 1000 bootstrap samples and see if better.
control_random <- trainControl(method = "boot", number = 1000)
metric = "Accuracy"
set.seed(567)
fitRF3 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF3 <- predict(fitRF3, testing)
RF_df3 <- data.frame(predictRF3, type=testing$class)
RF_df3
importance3 <- varImp(fitRF3)
importance3
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000138135 100.000
## ENSG00000267368 80.899
## ENSG00000081041 76.124
## ENSG00000108342 61.096
## ENSG00000180818 59.831
## ENSG00000136457 32.163
## ENSG00000235590 30.899
## ENSG00000148677 21.208
## ENSG00000007908 15.730
## ENSG00000134184 14.045
## ENSG00000143816 13.062
## ENSG00000110848 11.798
## ENSG00000255730 9.270
## ENSG00000163734 8.567
## ENSG00000136244 7.725
## ENSG00000205362 7.022
## ENSG00000281899 4.073
## ENSG00000172867 3.933
## ENSG00000107518 2.949
## ENSG00000105427 2.669
plot(importance3)
confusionMatrix(predictRF3, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 0 0
## myoPain 2 1
##
## Accuracy : 0.3333
## 95% CI : (0.0084, 0.9057)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.9630
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.3333
## Prevalence : 0.6667
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : healthy
##
The results with 1000 bootstrapped samples was poor on this 3rd run at 33% the same as the 1st run of 50 bootstrapped samples. Lets look again at the data frame of predicted results for the 50, 100, and then 1000 bootstrapped samples side by side to the test classification that was supposed to be identical to or predicted correctly.
The 50 bootstrapped samples below.
RF_df
The 100 bootstrapped samples below.
RF_df2
The 1000 bootstrapped samples below
RF_df3
In this version the 3rd sample is correctly identified as myoPain in all runs but only the 100 bootstrapped run correctly identified one of the healthy classes that was the first test observation.
Lets get those overall genes from the importance of variables output and write it to csv to add the genomic data to the top genes.
DF <- data.frame(importance[1])
DF
Order in descending order.
DF$Gene <- row.names(DF)
DataTop5 <- DF[,c(2,1)]
row.names(DataTop5) <- NULL
DataTop5genes <- DataTop5[order(DataTop5$Overall, decreasing=T),]
DataTop5genes
GenesTop5 <- DataTop5genes[1:5,]
write.csv(GenesTop5, "top5GenesMyalgia_part2.csv", row.names=F)
GenesTop5
Get the file we just wrote for part2 of means of samples top 5 genes here the alternate genecards names, summaries, and my own worded summary based on those summaries by genecards, NCBI, and UniProt will be added later.
The top5 genes of part 1 by means of samples is here and each one had an alternate Genecard name so that as well as the genecards.org, NCBI, and UniProt summaries as well as my own paraphrased summary of the gene based on those 3 summaries are added to this file.
Lets see if cross validation in the parameter tuning for random forest in the caret package of R has similar results in classification
control_random <- trainControl(method = "cv", number = 100)
metric = "Accuracy"
set.seed(567)
fitRF4 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF4 <- predict(fitRF4, testing)
RF_df4 <- data.frame(predictRF4, type=testing$class)
RF_df4
importance4 <- varImp(fitRF4)
importance4
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000108342 100.00
## ENSG00000180818 97.13
## ENSG00000267368 93.22
## ENSG00000138135 84.87
## ENSG00000081041 70.93
## ENSG00000136457 55.24
## ENSG00000148677 54.27
## ENSG00000235590 50.92
## ENSG00000007908 43.90
## ENSG00000107518 43.32
## ENSG00000143816 39.55
## ENSG00000163734 38.46
## ENSG00000255730 36.22
## ENSG00000136244 32.42
## ENSG00000110848 30.19
## ENSG00000281899 26.41
## ENSG00000134184 26.07
## ENSG00000125740 24.65
## ENSG00000244193 24.41
## ENSG00000267727 24.08
plot(importance4)
confusionMatrix(predictRF4, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 0 0
## myoPain 2 1
##
## Accuracy : 0.3333
## 95% CI : (0.0084, 0.9057)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.9630
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.3333
## Prevalence : 0.6667
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : healthy
##
Using cross validation with 100 folds, the same number of bootstrap samples in parameters of random forest classifier, returned poor results here at 33% by predicting the only healthy class the same as all the runs of the bootstrapped means of samples in earlier three runs. Good to compare. Will see what is output in published version to compare as well.
So, there were problems in knitr not running the same resulting outputs and classifying much better in the knitr html and pdf version and it will probably be the same for this run when published, but just know that the chunk run outputs have a different cache to empty and clear that could be related to the latex software using that runs its own memory store and it was not corrected because not solved and will likely produce outputs not the same as in markdown runs inside Rstudio.
*** Machine running on the medians of the top 40 genes in healthy vs myofascial pain samples using 50, 100, and 1000 bootstrapped samples of random forest classifier in R ***
Now we run the same algorithms with the medians data sets.
Bootstrapping 50 samples in parameters of random forest classifier with the medians’ top 40 genes.
control_random <- trainControl(method = "boot", number = 50)
metric = "Accuracy"
set.seed(12345)
intrain <- sample(1:12, .8*12)
training <- genes40_DF2[intrain,]
testing <- genes40_DF2[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitMedians1 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRFmedians <- predict(fitMedians1, testing)
RF_df_medians <- data.frame(predictRFmedians, type=testing$class)
RF_df_medians
We can see above the model predicted all the myopain and healthy classes which were 1 and 2 respectively with 100% accuracy using the medians of the samples.
importanceMedians <- varImp(fitMedians1)
importanceMedians
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000180818 100.000
## ENSG00000250978 91.475
## ENSG00000268292 79.637
## ENSG00000235590 49.134
## ENSG00000166676 38.204
## ENSG00000148677 34.384
## ENSG00000128242 31.532
## ENSG00000134184 28.788
## ENSG00000143816 23.757
## ENSG00000284686 16.381
## ENSG00000244193 7.130
## ENSG00000008056 6.403
## ENSG00000281899 5.381
## ENSG00000187634 4.843
## ENSG00000221932 4.305
## ENSG00000273000 3.313
## ENSG00000224114 3.094
## ENSG00000172867 3.010
## ENSG00000160883 2.421
## ENSG00000107518 2.287
plot(importanceMedians)
lets compare the values in our top gene above that is
ENSG00000180818.
topGene <- genes40_DF2[order(genes40_DF2$ENSG00000180818,decreasing =T),]
topGene$ENSG00000180818
## [1] 9.1766551 5.5757884 5.3240499 3.7774643 2.9742342 2.9142322 1.2094433
## [8] 0.7935440 0.6286581 0.3682484 0.3130012 0.1822497
These median values above are realistic to the true population rather than outliers that can skew the data. That is a tangent that came out earlier in the first part and what we explore now in this 2nd part of our machine learning of top genes in healthy vs myofascial pain samples. It appears the values are between 0.18 and 9.17 for the top gene produced on medians of the samples in bootstrapped 50 sample random forest modeling parameter.
confusionMatrix(predictRFmedians, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 2 0
## myoPain 0 1
##
## Accuracy : 1
## 95% CI : (0.2924, 1)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.2963
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6667
## Detection Rate : 0.6667
## Detection Prevalence : 0.6667
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : healthy
##
In Rstudio the results are 100% accuracy and same for sensitivity and specificity. Two healthy samples were correctly identified and the one myopain sample was correctly identified.
Lets see if there is consistency with 100 and 1000 bootstrapped samples on the medians top 40 genes of healthy vs myofascial pain samples.
Bootstrap of 100 samples in Random Forest classifier on the top 40 by medians of the fold change comparison of myofascial pain vs. healthy samples.
control_random <- trainControl(method = "boot", number = 100)
metric = "Accuracy"
set.seed(12345)
intrain <- sample(1:12, .8*12)
training <- genes40_DF2[intrain,]
testing <- genes40_DF2[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitMedians2 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRFmedians2 <- predict(fitMedians2, testing)
RF_df_medians2 <- data.frame(predictRFmedians2, type=testing$class)
RF_df_medians2
We can see above the model predicted only the healthy classes correctly and not the last class of myoPain.
importanceMedians2 <- varImp(fitMedians2)
importanceMedians2
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000180818 100.000
## ENSG00000268292 88.257
## ENSG00000250978 81.335
## ENSG00000235590 44.005
## ENSG00000166676 37.577
## ENSG00000148677 28.677
## ENSG00000128242 26.452
## ENSG00000134184 17.923
## ENSG00000143816 16.934
## ENSG00000284686 12.608
## ENSG00000221932 8.405
## ENSG00000244193 7.664
## ENSG00000172867 4.944
## ENSG00000187634 4.574
## ENSG00000224114 3.585
## ENSG00000182324 3.585
## ENSG00000281899 3.461
## ENSG00000267727 2.472
## ENSG00000008056 2.472
## ENSG00000242288 1.978
plot(importanceMedians2)
lets compare the values in our top gene above that is ENSG00000180818
that is the same top gene in earlier run of 50 bootstrap samples.
topGene <- genes40_DF2[order(genes40_DF2$ENSG00000180818,decreasing =T),]
topGene$ENSG00000180818
## [1] 9.1766551 5.5757884 5.3240499 3.7774643 2.9742342 2.9142322 1.2094433
## [8] 0.7935440 0.6286581 0.3682484 0.3130012 0.1822497
These median values above are realistic to the true population rather than outliers that can skew the data.
confusionMatrix(predictRFmedians2, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 2 1
## myoPain 0 0
##
## Accuracy : 0.6667
## 95% CI : (0.0943, 0.9916)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.7407
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.6667
## Neg Pred Value : NaN
## Prevalence : 0.6667
## Detection Rate : 0.6667
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : healthy
##
Seems to be inconsistent, not the same accuracy of excellence in using more bootstrapped samples of 100 compared to 50. Lets see if even more samples of 1000 bootstrapped samples will perform well.
Bootstrap of 1000 samples in Random Forest classifier on the top 40 by medians of the fold change comparison of myofascial pain vs. healthy samples.
control_random <- trainControl(method = "boot", number = 1000)
metric = "Accuracy"
set.seed(12345)
intrain <- sample(1:12, .8*12)
training <- genes40_DF2[intrain,]
testing <- genes40_DF2[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitMedians3 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRFmedians3 <- predict(fitMedians3, testing)
RF_df_medians3 <- data.frame(predictRFmedians3, type=testing$class)
RF_df_medians3
We can see above the model predicted all the myoPain incorrectly and healthy classes correctly, which were 1 and 2 respectively with 100% accuracy for healthy classes but 0% accuracy for myopain class using the medians of the samples and 1000 bootstrap samples.There is a reason for this, in using bootstrap of samples the majority of my econometric courses said having 40 samples is needed to get more realistic results in bootstrapping a large population out of a small population using the Central Limit Theorem and identically and independently distributing the data with replacement. This could also be a case of garbage in and garbage out as data scientists say. But many algorithms to use to see if a better result can be found in K-nearest neighbors, support vector machines in radial or concentric levels or linearly separable data, decision trees of rcart in caret for boosting models using learning rates, where bagging its bootstrap aggregating, k-means, bayesian classification models, linear discriminate analysis, and maybe a few more.
importanceMedians3 <- varImp(fitMedians3)
importanceMedians3
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000180818 100.000
## ENSG00000268292 99.099
## ENSG00000250978 80.566
## ENSG00000235590 45.946
## ENSG00000148677 32.046
## ENSG00000134184 31.145
## ENSG00000128242 23.166
## ENSG00000143816 22.136
## ENSG00000166676 19.562
## ENSG00000284686 14.414
## ENSG00000221932 8.623
## ENSG00000244193 8.366
## ENSG00000281899 6.821
## ENSG00000224114 5.920
## ENSG00000187634 5.148
## ENSG00000160883 3.089
## ENSG00000107518 2.574
## ENSG00000267727 2.574
## ENSG00000223343 2.445
## ENSG00000170509 2.445
plot(importanceMedians3)
lets compare the values in our top gene above that is ENSG00000180818
that is the same top gene in earlier run of 50 and 100 bootstrap
samples.
topGene <- genes40_DF2[order(genes40_DF2$ENSG00000180818,decreasing =T),]
topGene$ENSG00000180818
## [1] 9.1766551 5.5757884 5.3240499 3.7774643 2.9742342 2.9142322 1.2094433
## [8] 0.7935440 0.6286581 0.3682484 0.3130012 0.1822497
These median values above are realistic to the true population rather than outliers that can skew the data.
confusionMatrix(predictRFmedians3, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 2 1
## myoPain 0 0
##
## Accuracy : 0.6667
## 95% CI : (0.0943, 0.9916)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.7407
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.6667
## Neg Pred Value : NaN
## Prevalence : 0.6667
## Detection Rate : 0.6667
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : healthy
##
Seems that with 1000 bootstrap samples the top gene is the same but the last class was incorrectly identified as healthy when it was myofascial pain. The accuracy in Rstudio output with these parameters was 67% with 2 healthy classes correctly identified but one false positive of healthy class identified making the sensitivity 100% but the specificity 0% for not correctly identifying any of the myofascial classes. There was only 1 sample of myofascial and 2 of healthy so not the best population to form samples hence why bootstrap was done to simulate more randomness in a larger population with more samples.
Lets write out the top 5 features of our top model in the median top gene fold change samples of healthy vs myofascial pain.
importance50bags <- data.frame(importanceMedians[1])
genesImportant <- row.names(importance50bags)
importance50bags$gene <- genesImportant
importance50baggMedians <- importance50bags[order(importance50bags$Overall, decreasing=T),]
importance50baggMedians1 <- importance50baggMedians[,c(2,1)]
importance50baggMedians1
write.csv(importance50baggMedians1[1:5,], 'mediansImportanceFeatures50baggingTop5.csv',row.names=F)
You can find the above top 5 genes of importance in our best performing model of the median fold change values of healthy to myofascial pain here.
Now lets see how the 100 folds of cross validation parameter will do in the median samples compared to the bootstrapped samples.
control_random <- trainControl(method = "cv", number = 100)
metric = "Accuracy"
set.seed(12345)
intrain <- sample(1:12, .8*12)
training <- genes40_DF2[intrain,]
testing <- genes40_DF2[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitMedians4 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRFmedians4 <- predict(fitMedians4, testing)
RF_df_medians4 <- data.frame(predictRFmedians4, type=testing$class)
RF_df_medians4
Using 100 folds of cross validation parameter instead of bootstrapping in Rstudio gave an output of 100% accuracy with all healthy and myofascial pain classes correctly identified.
importanceMedians4 <- varImp(fitMedians4)
importanceMedians4
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000250978 100.000
## ENSG00000268292 85.807
## ENSG00000180818 82.284
## ENSG00000235590 43.153
## ENSG00000166676 37.416
## ENSG00000128242 31.554
## ENSG00000148677 26.269
## ENSG00000134184 23.073
## ENSG00000143816 18.334
## ENSG00000224114 10.900
## ENSG00000284686 6.909
## ENSG00000242288 5.987
## ENSG00000244193 5.912
## ENSG00000221932 4.615
## ENSG00000223343 4.116
## ENSG00000281899 3.866
## ENSG00000160883 3.866
## ENSG00000008056 3.492
## ENSG00000107518 3.417
## ENSG00000182324 2.744
plot(importanceMedians4)
Here the top gene is not ending in 180818 but instead is ENSG00000250978 and lets look at the values in this gene per sample.
topGene <- genes40_DF2[order(genes40_DF2$ENSG00000250978,decreasing =T),]
topGene$ENSG00000250978
## [1] 72.462711 65.868877 64.451998 39.580146 36.070355 13.645273 6.011879
## [8] 5.645763 3.889121 3.704525 3.368245 3.108345
The range is 3.1 to 72.46 with the median 13.64 between a higher value of 36.07 and lower value of 6.01.
plot(topGene$ENSG00000250978)
The healthy samples by index are the first 5 samples and the latter 7
samples are the myofascial pain samples. This gene has higher gene
expression values for the healthy class and lower expression values for
the myofascial pain class. It is the top feature of importance in
classifying a healthy sample from a myofascial pain sample using caret’s
random forest set to parameter of 100 folds of cross validation with
Accuracy the metric.
confusionMatrix(predictRFmedians4, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 2 0
## myoPain 0 1
##
## Accuracy : 1
## 95% CI : (0.2924, 1)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.2963
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6667
## Detection Rate : 0.6667
## Detection Prevalence : 0.6667
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : healthy
##
Seems like these genes could be useful in determining prediction classifiers for healthy and myofascial pain so lets save these top 5 genes to a file and get gene summaries for later.
samples of healthy vs myofascial pain.
importance100cv <- data.frame(importanceMedians4[1])
genesImportant_cv <- row.names(importance100cv)
importance100cv$gene <- genesImportant_cv
importance100cv_Medians <- importance100cv[order(importance100cv$Overall, decreasing=T),]
importance100cv_Medians1 <- importance100cv_Medians[,c(2,1)]
importance100cv_Medians1
write.csv(importance100cv_Medians1[1:5,], 'mediansImportanceFeatures100cvTop5.csv',row.names=F)
You can find the file here, that we just uploaded of top 5 genes with 100% accuracy in 100 folds of cross validation in predicting healthy or myofascial pain classes in our samples by top genes of fold change values of the medians of healthy vs medians of myofascial pain samples.
Thanks for viewing this document. Still some things to do as we found 100% accurate models in tuning parameters in 50 baggs of fold change means, 100 folds of cross validation and 100 baggs of the medians of our fold change values of healthy vs. myofascial pain in Rstudio. However, when this gets ran through knitr to publish and upload as html to rpubs it may have completely different results. Feel free to grab the google sheets data and explore for yourself with the same set.seed settings and parameters and files to see what you find.
Summaries have to be added to the genes of top 5 from best models as well to get a general idea of the features most important in separating healthy from myofascial pain samples of real people.
I have updated summaries on New Year’s Eve 2025. This part 2 of project was published with part 1 date but was done on 12-30-2025. Today the 1st of January in the year 2026 I am going to add to this file a review of the summaries of the top performing models’ top 5 genes.
Our top performing models were the means of the myofascial pain in ratio to means of healthy sample to get fold changes per gene per this ratio. We used 50 bootstrap aggregated samples in the model of random forest in caret. The 2nd model was the medians of the sample fold change per gene top 5 using 50 bootstap aggregates in the random forest model. And the third was the cross validation instead of bootstrapped samples with 100 folds.
Lets read in those models that the summaries have been added to.
means50baggsRF_top5 <- read.csv('top5GenesMyalgia_part2_summariesAdded_means50Baggs.csv', header=T, na.strings=c('',' ','na','NA'))
medians50baggsRF_top5 <- read.csv('mediansImportanceFeatures50baggingTop5_summariesAdded.csv', header=T, na.strings=c('',' ','na','NA'))
medians100foldsCV_RF_top5 <- read.csv('mediansImportanceFeatures100cvTop5_summariesAdded.csv', header=T, na.strings=c('',' ','na','NA'))
Lets add a model identification feature to each model so we can see side by side importance of the top 5 features by which model selected it.
means50baggsRF_top5$model <- as.factor('means50baggsRF')
medians50baggsRF_top5$model <- as.factor('medians50baggsRF')
medians100foldsCV_RF_top5$model <- as.factor('medians100foldsRF')
colnames3 <- colnames(means50baggsRF_top5)
colnames(medians50baggsRF_top5) <- colnames3
colnames(medians100foldsCV_RF_top5) <- colnames3
allDF3models <- rbind(means50baggsRF_top5,medians50baggsRF_top5,medians100foldsCV_RF_top5)
allDF3models
means50RF_baggs <- means50baggsRF_top5$Gene
medians50RF_baggs <- medians50baggsRF_top5$Gene
medians100RF_cv <- medians100foldsCV_RF_top5$Gene
allGenes <- c(means50RF_baggs, medians50RF_baggs, medians100RF_cv)
allGenes
## [1] "ENSG00000138135" "ENSG00000108342" "ENSG00000180818" "ENSG00000267368"
## [5] "ENSG00000081041" "ENSG00000180818" "ENSG00000250978" "ENSG00000268292"
## [9] "ENSG00000235590" "ENSG00000166676" "ENSG00000250978" "ENSG00000268292"
## [13] "ENSG00000180818" "ENSG00000235590" "ENSG00000166676"
Lets see these 15 ensemble genes by their alternate name if they have it, not all had a genecards name and so the ensemble or other name from Hugo Gene Nomenclature Committee or HGNC was used if available.
altGeneNames <- allDF3models$GeneCardsName
altGeneNames
## [1] "CH25H" "CSF3" "HOXC10" "UPK3BL1"
## [5] "CXCL2" "HOXC10" "LOC101928819" "ENSG00000268292"
## [9] "GNAS-AS1" "TVP23A" "LOC101928819" "ENSG00000268292"
## [13] "HOXC10" "GNAS-AS1" "TVP23A"
The above list of 15 genes is a list of all 15 top features of the 3 models top 5 importance in classifying a healthy or fibromyalgia class. There are repeats or duplicates. Lets see those duplicate repeats.
shared <- duplicated(altGeneNames)
totalShared <- altGeneNames[shared]
totalShared
## [1] "HOXC10" "LOC101928819" "ENSG00000268292" "HOXC10"
## [5] "GNAS-AS1" "TVP23A"
The genes that are seen in multiple models as top features are the HOXC10, LOC101928819, ENSG00000268292, GNAS-AS1, and TVP23A genes. The HOXC10 gene is duplicated in the list of totalShared.
Lets see this dataframe of the shared top genes and only make one HOXC10 in the list by removing the first instance of it in the list.
SharedGenesDF <- allDF3models[allDF3models$GeneCardsName %in% totalShared[2:6],]
SharedGenesDF
The common unique genes to all 3 models’ top 5 important features total five genes of: HOXC10, LOC101928819, ENSG00000268292, GNAS-AS1, and TVP23A.
These genes are transcription factor regulator gene, three long non-coding RNA genes that maintain DNA chromatin wrapped integrity and in transcription of DNA into RNA. Another gene is a membrane protein of the organelle in the cell cytoplasm for the golgi apparatus that helps package and transport products to other cell organelles of endosomes and lysosomes for product and toxin recycling or destroying and to the plasma membrane for transportation outside the cell for body to use. This makes sense since this is gene expression with RNA data that is inside the nucleus and needs trasportation of protein products outside the cell nucleus and cytoplasm. We didn’t really find any remarkable genes other than transcription regulators in the cell that make proteins. Seems like fibromyalgia sufferers have chronic pain from nuclear cell processes that are persisting beyond injury and are chronically inflamed systematically.
The following is supposed to isolate those genes that are not shared but it brings back all genes and their duplicates from original 15 genes.
totalNotShared <- altGeneNames[!(altGeneNames %in% shared)]
totalNotShared
## [1] "CH25H" "CSF3" "HOXC10" "UPK3BL1"
## [5] "CXCL2" "HOXC10" "LOC101928819" "ENSG00000268292"
## [9] "GNAS-AS1" "TVP23A" "LOC101928819" "ENSG00000268292"
## [13] "HOXC10" "GNAS-AS1" "TVP23A"
If the output is different quite possibly, let me print out the
results of above code. [1] “CH25H” “CSF3” “HOXC10”
[4] “UPK3BL1” “CXCL2” “HOXC10”
[7] “LOC101928819” “ENSG00000268292” “GNAS-AS1”
[10] “TVP23A” “LOC101928819” “ENSG00000268292” [13] “HOXC10” “GNAS-AS1”
“TVP23A”
You can see all genes returned. The genes that are not shared are CH25H, CSF3, CXCL2, and UPK3BL1.
I will manually see what models selected these genes and compare the results and gene summary.
notSharedDF <- allDF3models[allDF3models$GeneCardsName %in% c("CH25H", "CSF3","UPK3BL1","CXCL2"),]
notSharedDF
The first model selected all these genes of the means of samples myofascial pain over healthy means both fold change values per gene. No other model selected these genes. In this model, Accuracy was the metric, and the method was random forest in caret with a trControl of bootstrap aggregating with 50 randomized samples.
Overall these 4 genes not shared among all 3 models show CH25H increases fat metabolism and protection against enveloped viruses, CSF3 increases neutrophil granulocytes and slows hematopoeisis in the process, UPK3BL1 has a membrane protein that may be able to do what membrane proteins do is guard the entry and exit of cell molecules and products or resources via a gate or receptor mechanism, and CXCL2 is a cytokine seen at the site of injury with inflammation where activated neutrophils and macrophages produce it. That says the uncommon genes between these models as top importance features in classifying fibromyalgia vs healthy, that inflammation is present, infection is a possibility protected against with a pain enducing cytokine, WBCs are increased, and cell proliferation and differentiation is slowed in hematopoiesis of lymphocytes and RBCs while increasing WBCs of neutrophils and macrophages. The plasma membrane of cells have an increased contender with unstated cell function, while the fatty components of the plasma membrane a lipid or fat is being destroyed by increased fat metabolism that also destroys enveloped viruses. Its like the host’s body is shut down and not allowing anything close to it while it looks to destroy any new antigens or persisting ones.
All this data was taken from the fragment portion of genes in the nuclear RNA sequence data and the count data wasn’t used but available of the number of fragments per kilo million called FPKM. In the cell fragments of DNA can be abundant as it is being translated or destroyed, so there could be information lacking in the sequencing that allows certain transcription factors that allow translation of RNA into a protein to continue while having the excess RNA that is old in circulation recycled hence the golgi apparatus protein and use of lysosome and endosome functions in the common gene of TVP23A.
All genes tell a story in the body suffering myalgia pain that there is persistant inflammation from cytokine activity, slowed hematopoeisis, increased neutrophil and macrophage or monocyte activity, transcription factor long non-coding RNA activity that regulate copying of DNA to be transcribed into RNA ready for translating a protein to use by the body. And the HOXC10 gene that is a embryo development gene responsible for giving cells their anterior and posterior cell position and seen in nerve disorders when defective. The HOXC10 is common to all3 models and does encode a gene as well as is a transcription factor. Myalgia pain if greater than injury and chronic is known to have nerve deficits and a low threshold to pain same as fibromyalgia. This last part focused on the importance features to determine the genes that are important to understanding myalgia pain.
Lets look back at the original data of samples and fold change values for all these genes to see the up or down regulation of fold change values for these 15 top importance features genes.
top15DF <- allDF3models[allDF3models$GeneCardsName %in% altGeneNames,]
top15DF
colnames(top15DF)
## [1] "Gene" "Overall" "GeneCardsName"
## [4] "geneCardsSummary" "NCBI_summary" "UniProt_summary"
## [7] "overallGeneSummary" "chromosomeLocation" "model"
top15DF_allsamplesMeta <- merge(allDF3models, foldchangeHealthyMyoMedians, by.x='Gene', by.y='Gene')
top15DF_allsamplesMeta
colnames(top15DF_allsamplesMeta)
## [1] "Gene" "Overall"
## [3] "GeneCardsName" "geneCardsSummary"
## [5] "NCBI_summary" "UniProt_summary"
## [7] "overallGeneSummary" "chromosomeLocation"
## [9] "model" "Healthy1"
## [11] "Healthy2" "Healthy3"
## [13] "Healthy4" "Healthy5"
## [15] "myo1" "myo2"
## [17] "myo3" "myo4"
## [19] "myo5" "myo6"
## [21] "myo7" "healthy_Mean"
## [23] "myo_Mean" "healthyMedian"
## [25] "myoMedian" "foldchangeHealthyVsMyo"
## [27] "foldchangeHealthyVsMyoMedians"
Lets move the model feature next to Overall feature to show the model and rating of the gene in prediction of class.
top15DF_all <- top15DF_allsamplesMeta[,c(1,2,9,3,7,26,27,10:25,4:8)]
top15DF_all
Lets write this data frame to csv file and compare information.
write.csv(top15DF_all, 'top15DF_all.csv', row.names=F)
You can find that file here.
Just looking at the fold change or side by side comparisons of means and median values of each gene by class of myo pain or healthy, we know all these genes that are affected are down regulated in gene expression. So, we can take what we know of these inflammatory markes and say there is less of that going on in these sufferers of myofascial pain.
Less cytokine activity, more hematopoeisis, less toxin and fat metabolism, less plasma membrane protein of some sort that could be a gate or receptor to let in and out products and substrates to use by the cell or take cell products such as proteins made into the body. There is also less gene expression of the gene HOXC10 that regulates cell position from the anterior-posterior axis in embryo development that when defective shows neurological disorders in its host. Less of the transcriptio regulators of RNA being copied into pre-mRNA to make a protein and regulate DNA is occuring in the cell.
So it seems that the fibromyalgia sufferers don’t have too much inflammation but have cellular transcription deficits of protein and making too many lipids and fat, not digesting enough toxins, and are just very toxic and in need of detoxing to get their cellular functions working properly.
You can go to the beginning and read all about the study that created this gene expression data and shared on NCBI by visiting their GSE series and GPL platform. Those are found at GSE215154 series data
Part 1 is uploaded already to Rpubs and this has additions to that first published markdown file so that it is longer.
Since we got great results already using Random Forest classifier in bootstrap and cross validation parameter tuning we assumed these genes are those top genes influencing myofascial pain patterns. We investigated these genes produced and found many cellular processes were downregulated and a common denominator gene of HOXC10 was found under expressed in myofascial cases and a top feature in predicting a class as healthy or of having myofascial pain. HOXC10 when defective can have neurological deficits in the body. Many other top genes were transcription regulators, down regulation of the cellular processes that detox fats and toxins from the body, and down regulation of a plasma protein that may be a cell receptor or cell gate that allows entry and exit of products and substrates into the cell and products the body needs out of the cell. The series that created these samples of FPKM and counts of RNA molecules didn’t say if these are the RNA sequencing gene expression profiles before injection of MEF2C a known pain enhancer of musculoskeletal pain or after injection of DEX to inhibit pain after the samples were injected with MEF2C, or if these samples are before administering of any exogenous substance. So, we assume with this data science project that the samples are at some state during the experiment where they are separate classes or samples from individual hosts, not the same host at different points in time, and relate to each other in whatever process such as the healthy and myofascial pain samples are each at a point in time before administering of exogenous substances, at the time of administering MEF2C, or at the time of administering DEX done after administering MEF2C. The way the study was described, makes it possible that the myoPain class is actually the only one MEF2C was administered and also administered DEX to inhibit pain, and the healthy class is that which was injected with TNF-a a pain inflammatory biomarker, and then the healthy class was administered with DEX to inhibit pain.
Epstein-Barr is a virus with associations claimed as having with sufferers of fibromyalgia though not really enough to assert it into the gold standard testing and diagnostic pathologies of Merck Epstein-Barr is also associated loosely with multiple sclerosis and strongly with mononucleosis that is strongly associated with Epstein-Barr infection. Hodgkin Lymphoma is also associated loosely with an Epstein-Barr infection at one point in life.
This is a stepping stone research into finding out if those top genes are making appearances in samples gathered from online respectable resources to compare multiple sclerosis, fibromyalgia, mononucleosis, EBV infection, and Hodgkin disease.
Thanks so much and keep checking in to see this progress.