This markdown file for R is about myofascial pain and taken from the GSE215154 series data that found high throughput sequencing fastp data of 5 healthy samples compared to 7 samples of patients with triggerpoint myofascial pain. A search of the GSE215154 series data at NCBI gene studies will bring you to the series.
I manually downloaded each text file of the samples and the 7 text files for trigger point myofascial pain. This was the most recent study evaluating fibromyalgia patient genomic data I found a month or so ago and am now reviewing it in searching for top genes in myofascial pain or chronic pain patterns, mononucleosis, multiple sclerosis, Hodgkins disease, and Epstein-Barr infections that relate back to the common denominator at some point of life for other pathologies listed. This data science project focuses on chronic pain in those with myofascial pain patterns similar to fibromyalgia that currently does not have any diagnostic gene or associated genes or special lab testing to confirm disease. Currently only a questionnaire is available and monitoring over about 3 months while ruling out autoimmune disorders to diagnose fibromyalgia as a source of pain out of normal threshold values. This disorder was previously an ‘in your head’ disease that females got mostly and doctors would tell them to take OTC for pain and now get lifestyle in check to improve diet, sleep, stress, exercise, massage, adjustments, and meditation.
Lets read in the sample text files separately and combine them into one data frame. The sample information says that the aquisition of the data used a count of the fragments of each gene, then normalized it, and also added in the FPKM values as a final step. These are tab delimited files. When reading in the file per sample you see the gene as well as the count (of fragments then normalized) and FPKM value per gene. Internet search says the normalization is the FPKM values for the counts of fragments in RNA-seq data of transcriptional gene data using a fragment length. The gene length is multiplied by 10 billion or 10^9= 10,000,000,000, then divided by the transcripts of gene X gene length. More Information on FPKM is fragments per kilo million and TPM is transcripts per million and CPM is counts per million.
Lets load in our packages first to use later.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
healthy1 <- read.table('GSM6623971_C1.txt', header=T, sep="")
summary(healthy1)
## gene_id C1_FPKM C1_count
## Length:58302 Min. :0.000e+00 Min. : 0.0
## Class :character 1st Qu.:0.000e+00 1st Qu.: 0.0
## Mode :character Median :0.000e+00 Median : 0.0
## Mean :5.472e+00 Mean : 197.1
## 3rd Qu.:3.379e-01 3rd Qu.: 10.0
## Max. :3.058e+04 Max. :754337.0
head(healthy1)
Read in the other 4 healthy samples as tables
healthy2 <- read.table('GSM6623972_C2.txt', header=T, sep="")
healthy3 <- read.table('GSM6623973_C3.txt', header=T, sep="")
healthy4 <- read.table('GSM6623974_C4.txt', header=T, sep="")
healthy5 <- read.table('GSM6623975_C5.txt', header=T, sep="")
Lets column bind them as they all have the same number of rows and columns, but lets remove one of the variables, the counts, and keep FPKM or fragments per kilomillion of each gene. And remove the other gene columns other than the first table.
colnames(healthy1)[2] <- 'healthy1-FPKM'
colnames(healthy2)[2] <- 'healthy2-FPKM'
colnames(healthy3)[2] <- 'healthy3-FPKM'
colnames(healthy4)[2] <- 'healthy4-FPKM'
colnames(healthy5)[2] <- 'healthy5-FPKM'
healthy5samples <- cbind(healthy1[,-3], healthy2[,-c(1,3)],healthy3[,-c(1,3)], healthy4[,-c(1,3)], healthy5[,-c(1,3)])
colnames(healthy5samples) <- c('Gene','Healthy1','Healthy2','Healthy3','Healthy4','Healthy5')
head(healthy5samples)
Lets do the same with the myofacial 7 samples.
myofascial1 <- read.table('GSM6623976_M1.txt', header=T, sep="")
myofascial2 <- read.table('GSM6623977_M2.txt', header=T, sep="")
myofascial3 <- read.table('GSM6623978_M3.txt', header=T, sep="")
myofascial4 <- read.table('GSM6623979_M4.txt', header=T, sep="")
myofascial5 <- read.table('GSM6623980_M5.txt', header=T, sep="")
myofascial6 <- read.table('GSM6623981_M6.txt', header=T, sep="")
myofascial7 <- read.table('GSM6623982_M7.txt', header=T, sep="")
colnames(myofascial1)[2] <- 'myofascial1-FPKM'
colnames(myofascial2)[2] <- 'myofascial2-FPKM'
colnames(myofascial3)[2] <- 'myofascial3-FPKM'
colnames(myofascial4)[2] <- 'myofascial4-FPKM'
colnames(myofascial5)[2] <- 'myofascial5-FPKM'
colnames(myofascial6)[2] <- 'myofascial6-FPKM'
colnames(myofascial7)[2] <- 'myofascial7-FPKM'
myofascial7samples <- cbind(myofascial1[,-3], myofascial2[,-c(1,3)],myofascial3[,-c(1,3)], myofascial4[,-c(1,3)], myofascial5[,-c(1,3)], myofascial6[,-c(1,3)], myofascial7[,-c(1,3)])
colnames(myofascial7samples) <- c('Gene','myo1','myo2','myo3','myo4','myo5','myo6','myo7')
head(myofascial7samples)
Lets column bind the healthy 5 samples to the 7 myofascial pain samples with the Gene order they all have the same number of rows of.
dataHealthy5Myo7 <- cbind(healthy5samples,myofascial7samples[,-1])
head(dataHealthy5Myo7)
Summarize the data frame just made to find the mean of each sample’s normalized FPKM values within the 58,302 genes.
summary(dataHealthy5Myo7)
## Gene Healthy1 Healthy2 Healthy3
## Length:58302 Min. :0.000e+00 Min. :0.000e+00 Min. : 0.000
## Class :character 1st Qu.:0.000e+00 1st Qu.:0.000e+00 1st Qu.: 0.000
## Mode :character Median :0.000e+00 Median :0.000e+00 Median : 0.000
## Mean :5.472e+00 Mean :5.188e+00 Mean : 5.447
## 3rd Qu.:3.379e-01 3rd Qu.:6.688e-01 3rd Qu.: 0.464
## Max. :3.058e+04 Max. :2.376e+04 Max. :50031.764
## Healthy4 Healthy5 myo1
## Min. : 0.000 Min. : 0.000 Min. :0.000e+00
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.:0.000e+00
## Median : 0.000 Median : 0.000 Median :0.000e+00
## Mean : 5.764 Mean : 5.556 Mean :4.861e+00
## 3rd Qu.: 0.426 3rd Qu.: 0.467 3rd Qu.:7.898e-01
## Max. :37775.449 Max. :39829.283 Max. :1.542e+04
## myo2 myo3 myo4
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000 Median : 0.000
## Mean : 5.766 Mean : 5.738 Mean : 5.113
## 3rd Qu.: 0.443 3rd Qu.: 0.309 3rd Qu.: 0.376
## Max. :43180.511 Max. :40626.886 Max. :37108.454
## myo5 myo6 myo7
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000 Median : 0.000
## Mean : 5.415 Mean : 5.175 Mean : 5.864
## 3rd Qu.: 0.341 3rd Qu.: 0.371 3rd Qu.: 0.377
## Max. :41750.951 Max. :37369.240 Max. :41902.286
dim(dataHealthy5Myo7)
## [1] 58302 13
write.csv(dataHealthy5Myo7, 'fragmentsPerKMill_myofascial_5healthy_7myo_58kgenes.csv',row.names=F)
Get this file we made here file
Lets get the sample mean per gene in healthy and myofascial pain samples to compare their fold change values.
healthyMeans <- dataHealthy5Myo7 %>% group_by(Gene) %>%
mutate(
healthy_Mean = mean(Healthy1:Healthy2,na.rm=T),
myo_Mean = mean(myo1:myo7)
)
colnames(healthyMeans)
## [1] "Gene" "Healthy1" "Healthy2" "Healthy3" "Healthy4"
## [6] "Healthy5" "myo1" "myo2" "myo3" "myo4"
## [11] "myo5" "myo6" "myo7" "healthy_Mean" "myo_Mean"
Now get the Median of each sample group.
healthyMeansMedians <- healthyMeans %>% group_by(Gene) %>% mutate(healthyMedian=median(Healthy1:Healthy5, na.rm=T), myoMedian = median(myo1:myo7))
colnames(healthyMeansMedians)
## [1] "Gene" "Healthy1" "Healthy2" "Healthy3"
## [5] "Healthy4" "Healthy5" "myo1" "myo2"
## [9] "myo3" "myo4" "myo5" "myo6"
## [13] "myo7" "healthy_Mean" "myo_Mean" "healthyMedian"
## [17] "myoMedian"
head(healthyMeansMedians)
Looks like the median is the same for mean in the two samples from the first 5 entries in the data. Lets get the fold change of myofascial pain to healthy gene expression FPKM normalized values.
foldchangeHealthyMyo <- healthyMeansMedians %>% mutate(foldchangeHealthyVsMyo=(myo_Mean/(healthy_Mean+0.0000000000001)))
head(foldchangeHealthyMyo)
To avoid the NaN values of healthy 0 in denominator, I added a very very small value of one trillionth of 1. Now lets order the data by fold change from most to least and take the top 20 and bottom 20 genes and make a data frame to do some machine learning on with our 2 classes of healthy and myofascial pain.
FC_ordered <- foldchangeHealthyMyo[order(foldchangeHealthyMyo$foldchangeHealthyVsMyo, decreasing=T),]
head(FC_ordered)
tail(FC_ordered)
There may be a very vast number of genes with a zero value, so the bottom 20 won’t be revealing anything for the gene expression down regulation in myofascial pain.
sum(FC_ordered$foldchangeHealthyVsMyo==0)
## [1] 32916
There are 32,916 genes equal to zero in fold change. Since these values are normalized fragments per kilomillion, we will have to look at the bottom values above 0 that are lowest.
sum(FC_ordered$foldchangeHealthyVsMyo < 0)
## [1] 0
There are not any negative values. but more than 32,000 genes with 0 fold change, won’t give much information if ordering in decreasing order and last 32,000 genes have no information for fold change.
minFC <- subset(FC_ordered, FC_ordered$foldchangeHealthyVsMyo > 0)
tail(minFC)
We removed those 32,000 genes that had no information to add in fold change values for this RNA-seq data, and still have 25,386 genes left. We can now take the bottom 20 and top 20 to make a data frame to use in training a model to predict the best in our 2 classes of healthy or myofascial pain.
minMax20_FC <- minFC[c(1:20,25367:25386),]
dim(minMax20_FC)
## [1] 40 18
That gives us 40 genes to work with. Lets see those genes.These are Ensemble genes. They may or may not have alternate Genecard.org gene names but should have a summary on Genecards.org that we can read all about once we get our top genes.
minMax20_FC$Gene
## [1] "ENSG00000267727" "ENSG00000178372" "ENSG00000172867" "ENSG00000188508"
## [5] "ENSG00000178934" "ENSG00000167768" "ENSG00000114771" "ENSG00000095970"
## [9] "ENSG00000267368" "ENSG00000203782" "ENSG00000189001" "ENSG00000160883"
## [13] "ENSG00000271367" "ENSG00000284627" "ENSG00000204421" "ENSG00000248871"
## [17] "ENSG00000203618" "ENSG00000243466" "ENSG00000105427" "ENSG00000186458"
## [21] "ENSG00000281899" "ENSG00000235590" "ENSG00000125740" "ENSG00000180818"
## [25] "ENSG00000244193" "ENSG00000110848" "ENSG00000081041" "ENSG00000143816"
## [29] "ENSG00000136457" "ENSG00000134184" "ENSG00000163734" "ENSG00000242288"
## [33] "ENSG00000108342" "ENSG00000138135" "ENSG00000107518" "ENSG00000148677"
## [37] "ENSG00000205362" "ENSG00000007908" "ENSG00000255730" "ENSG00000136244"
write.csv(minMax20_FC, 'min20max20geneFoldchangeMyofascialPain.csv', row.names=F)
Get the file we just wrote here
colnames(minMax20_FC)
## [1] "Gene" "Healthy1" "Healthy2"
## [4] "Healthy3" "Healthy4" "Healthy5"
## [7] "myo1" "myo2" "myo3"
## [10] "myo4" "myo5" "myo6"
## [13] "myo7" "healthy_Mean" "myo_Mean"
## [16] "healthyMedian" "myoMedian" "foldchangeHealthyVsMyo"
Make the data frame but keep gene names in a character vector to make header in new transposed data frame to get samples as rows and the 40 genes as the features and add in a class feature for the sample class where we only have 12 samples of 5 healthy and 7 myofascial.
geneNames <- minMax20_FC$Gene
genes40_DF <- data.frame(t(minMax20_FC[,c(2:13)]))
colnames(genes40_DF) <- geneNames
genes40_DF$class <- c('healthy','healthy','healthy','healthy','healthy','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain','myoPain')
head(genes40_DF)
Change the class from character to factor to allow confusionMatrix function to work later in this program.
genes40_DF$class <- as.factor(genes40_DF$class)
summary(genes40_DF)
## ENSG00000267727 ENSG00000178372 ENSG00000172867 ENSG00000188508
## Min. : 0.00000 Min. : 0.0000 Min. : 0.00000 Min. : 0.000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.000
## Median : 0.00000 Median : 0.0000 Median : 0.00000 Median : 0.000
## Mean : 16.50957 Mean : 7.4553 Mean : 3.90308 Mean : 3.192
## 3rd Qu.: 0.09357 3rd Qu.: 0.1931 3rd Qu.: 0.01573 3rd Qu.: 0.000
## Max. :196.66379 Max. :79.0971 Max. :41.40356 Max. :30.425
## ENSG00000178934 ENSG00000167768 ENSG00000114771 ENSG00000095970
## Min. : 0.00000 Min. : 0.0000 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.00000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 0.00000 Median : 0.0000 Median : 0.000 Median : 0.0000
## Mean : 2.97886 Mean : 2.6479 Mean : 1.733 Mean : 1.5420
## 3rd Qu.: 0.04052 3rd Qu.: 0.3158 3rd Qu.: 0.000 3rd Qu.: 0.0117
## Max. :26.20979 Max. :24.8739 Max. :19.873 Max. :18.0198
## ENSG00000267368 ENSG00000203782 ENSG00000189001 ENSG00000160883
## Min. : 0.0000 Min. : 0.00 Min. :0.000 Min. :0.00000
## 1st Qu.: 0.2323 1st Qu.: 0.00 1st Qu.:0.000 1st Qu.:0.07218
## Median : 3.9112 Median : 0.00 Median :0.000 Median :0.15182
## Mean : 3.8415 Mean : 1.25 Mean :1.088 Mean :1.04723
## 3rd Qu.: 4.8437 3rd Qu.: 0.00 3rd Qu.:0.000 3rd Qu.:0.35011
## Max. :12.8966 Max. :12.68 Max. :9.905 Max. :9.61607
## ENSG00000271367 ENSG00000284627 ENSG00000204421 ENSG00000248871
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.3512 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.3838 Median :0.0000 Median :0.0000
## Mean :0.8291 Mean :0.9975 Mean :0.6937 Mean :0.9914
## 3rd Qu.:0.1984 3rd Qu.:0.6804 3rd Qu.:0.0000 3rd Qu.:0.4922
## Max. :8.9279 Max. :6.4229 Max. :7.2011 Max. :7.5098
## ENSG00000203618 ENSG00000243466 ENSG00000105427 ENSG00000186458
## Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.09635 Median :0.00000
## Mean :0.6235 Mean :0.6009 Mean :0.62817 Mean :0.55108
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.34734 3rd Qu.:0.01939
## Max. :7.4821 Max. :6.4375 Max. :5.37288 Max. :6.37223
## ENSG00000281899 ENSG00000235590 ENSG00000125740 ENSG00000180818
## Min. : 0.2017 Min. :0.01404 Min. :0.00000 Min. :0.1822
## 1st Qu.: 1.2082 1st Qu.:0.34122 1st Qu.:0.04294 1st Qu.:0.5636
## Median : 2.1032 Median :0.46236 Median :0.06920 Median :2.0618
## Mean : 3.2158 Mean :0.58643 Mean :0.49287 Mean :2.7698
## 3rd Qu.: 4.0849 3rd Qu.:0.83383 3rd Qu.:0.38365 3rd Qu.:4.1641
## Max. :10.3186 Max. :1.39007 Max. :4.14229 Max. :9.1767
## ENSG00000244193 ENSG00000110848 ENSG00000081041 ENSG00000143816
## Min. :0.1126 Min. :0.00000 Min. : 0.4343 Min. :0.03081
## 1st Qu.:0.1698 1st Qu.:0.00000 1st Qu.: 1.5015 1st Qu.:0.31764
## Median :0.2284 Median :0.04472 Median : 3.0891 Median :0.41963
## Mean :1.8863 Mean :0.30633 Mean : 16.2635 Mean :0.50944
## 3rd Qu.:4.3612 3rd Qu.:0.18548 3rd Qu.: 10.9835 3rd Qu.:0.56658
## Max. :6.6017 Max. :2.60310 Max. :140.2944 Max. :1.20174
## ENSG00000136457 ENSG00000134184 ENSG00000163734 ENSG00000242288
## Min. : 0.1767 Min. :0.2122 Min. :0.05608 Min. :0.01604
## 1st Qu.: 0.4568 1st Qu.:0.2489 1st Qu.:0.07749 1st Qu.:0.20070
## Median : 0.7175 Median :0.3456 Median :0.13468 Median :0.47441
## Mean : 3.8902 Mean :3.1536 Mean :0.66453 Mean :0.66045
## 3rd Qu.: 1.3671 3rd Qu.:7.1168 3rd Qu.:0.22441 3rd Qu.:0.94741
## Max. :23.7187 Max. :8.9442 Max. :6.20963 Max. :2.08076
## ENSG00000108342 ENSG00000138135 ENSG00000107518 ENSG00000148677
## Min. : 0.0000 Min. : 0.00000 Min. :0.00000 Min. : 0.8035
## 1st Qu.: 0.1687 1st Qu.: 0.06229 1st Qu.:0.09542 1st Qu.: 3.2329
## Median : 0.3364 Median : 0.21163 Median :0.16875 Median : 4.4409
## Mean : 2.8969 Mean : 2.60864 Mean :0.27688 Mean : 28.7645
## 3rd Qu.: 0.5461 3rd Qu.: 0.33091 3rd Qu.:0.44621 3rd Qu.: 11.0251
## Max. :31.1028 Max. :29.18270 Max. :0.93217 Max. :216.5660
## ENSG00000205362 ENSG00000007908 ENSG00000255730 ENSG00000136244
## Min. : 0.0000 Min. : 0.00000 Min. : 0.00000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.00000 1st Qu.: 0.01814 1st Qu.: 0.0838
## Median : 0.4034 Median : 0.04708 Median : 0.03505 Median : 0.1816
## Mean : 13.3179 Mean : 2.81096 Mean : 1.58185 Mean : 3.7752
## 3rd Qu.: 0.6260 3rd Qu.: 0.40651 3rd Qu.: 0.05590 3rd Qu.: 0.2399
## Max. :155.6212 Max. :31.90556 Max. :18.62400 Max. :43.7147
## class
## healthy:5
## myoPain:7
##
##
##
##
Lets just make sure we have the same number in each class sample and only 2 classes say if a typo or extra space was added in making class factor vector.
genes40_DF %>% group_by(class) %>% count(class)
There are only 2 classes and same number in each class that can be inspected visually to see all healthy at top and all myoPain at bottom. Now we can start using some modeling algorithms in the caret package we installed earlier. Lets use boot strapping in this model instead of cross validation as we have to simulate randomness and more samples from independently and identically distributing with replacement. We will use a classifier for ensemble as well as linear modeling.
control_random <- trainControl(method = "boot", number = 50)
metric = "Accuracy"
intrain <- sample(1:12, .8*12)
training <- genes40_DF[intrain,]
testing <- genes40_DF[-intrain,]
training %>% group_by(class) %>% count(class)
testing %>% group_by(class) %>% count(class)
set.seed(567)
fitRF <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF <- predict(fitRF, testing)
RF_df <- data.frame(predictRF, type=testing$class)
RF_df
We can see above the model predicted all the myoPain class and only half the healthy class correctly with 2/3 of data correctly predicted.
importance <- varImp(fitRF)
importance
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000081041 100.00
## ENSG00000163734 85.69
## ENSG00000148677 81.13
## ENSG00000235590 77.80
## ENSG00000138135 67.40
## ENSG00000108342 67.19
## ENSG00000107518 64.12
## ENSG00000205362 62.33
## ENSG00000143816 59.11
## ENSG00000180818 55.96
## ENSG00000255730 50.72
## ENSG00000125740 46.38
## ENSG00000284627 44.40
## ENSG00000136457 40.03
## ENSG00000242288 38.65
## ENSG00000134184 37.20
## ENSG00000281899 36.44
## ENSG00000105427 36.16
## ENSG00000178372 35.89
## ENSG00000267368 35.38
plot(importance)
lets compare the values in our top gene above that is ENSG00000081041.
We can see with our small sample data of 12 samples that there is one
sample that is an outlier and much higher than the rest of the
samples.
topGene <- genes40_DF[order(genes40_DF$ENSG00000081041,decreasing =T),]
topGene$ENSG00000081041
## [1] 140.2943774 15.3743775 13.3465208 10.1957661 4.1482004 3.6057617
## [7] 2.5723394 1.7667014 1.6710862 0.9928994 0.7591185 0.4343249
Most all samples are under 20 except one and its in the healthy sample set. Maybe we should use the median values as those are realistic to the true population rather than outliers that can skew the data.That could be another R markdown project to do later.
confusionMatrix(predictRF, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 1 0
## myoPain 0 2
##
## Accuracy : 1
## 95% CI : (0.2924, 1)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.2963
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.3333
## Detection Rate : 0.3333
## Detection Prevalence : 0.3333
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : healthy
##
Our model using bootstrap in random forest an ensemble classifier and regression algorithm scored pretty well at 67% accuracy. There were limited samples here but bootstrapping helps to get a natural set of samples using the Central limit Theorem and other principles from econometrics.
Lets quickly test how increasing to 100 bootstrap samples will change the metric results.
control_random <- trainControl(method = "boot", number = 100)
metric = "Accuracy"
set.seed(567)
fitRF2 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF2 <- predict(fitRF2, testing)
RF_df2 <- data.frame(predictRF2, type=testing$class)
RF_df2
importance2 <- varImp(fitRF2)
importance2
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000081041 100.00
## ENSG00000235590 99.47
## ENSG00000163734 88.73
## ENSG00000148677 77.81
## ENSG00000138135 65.94
## ENSG00000136244 63.83
## ENSG00000180818 61.24
## ENSG00000143816 59.06
## ENSG00000107518 55.11
## ENSG00000125740 54.06
## ENSG00000205362 53.45
## ENSG00000108342 52.13
## ENSG00000255730 47.54
## ENSG00000134184 47.15
## ENSG00000267368 45.92
## ENSG00000007908 40.99
## ENSG00000284627 40.33
## ENSG00000281899 38.47
## ENSG00000136457 37.83
## ENSG00000271367 36.46
plot(importance2)
confusionMatrix(predictRF2, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 1 0
## myoPain 0 2
##
## Accuracy : 1
## 95% CI : (0.2924, 1)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.2963
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.3333
## Detection Rate : 0.3333
## Detection Prevalence : 0.3333
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : healthy
##
The results were the same using 100 bootstrapped samples vs 50. Lets try 1000 bootstrap samples and see if better.
control_random <- trainControl(method = "boot", number = 1000)
metric = "Accuracy"
set.seed(567)
fitRF3 <- train(class~., method='rf', data=training, metric=metric, trControl=control_random, na.action=na.pass)
predictRF3 <- predict(fitRF3, testing)
RF_df3 <- data.frame(predictRF3, type=testing$class)
RF_df3
importance3 <- varImp(fitRF3)
importance3
## rf variable importance
##
## only 20 most important variables shown (out of 40)
##
## Overall
## ENSG00000081041 100.00
## ENSG00000235590 74.90
## ENSG00000163734 73.40
## ENSG00000136244 67.52
## ENSG00000205362 59.31
## ENSG00000136457 52.55
## ENSG00000148677 52.10
## ENSG00000180818 50.69
## ENSG00000281899 50.43
## ENSG00000138135 45.77
## ENSG00000108342 44.41
## ENSG00000125740 44.36
## ENSG00000255730 43.99
## ENSG00000107518 43.85
## ENSG00000267368 39.26
## ENSG00000143816 38.08
## ENSG00000284627 36.08
## ENSG00000134184 35.98
## ENSG00000110848 35.12
## ENSG00000178372 34.17
plot(importance3)
confusionMatrix(predictRF3, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction healthy myoPain
## healthy 1 0
## myoPain 0 2
##
## Accuracy : 1
## 95% CI : (0.2924, 1)
## No Information Rate : 0.6667
## P-Value [Acc > NIR] : 0.2963
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.3333
## Detection Rate : 0.3333
## Detection Prevalence : 0.3333
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : healthy
##
The same results in accuracy, sensitivity, and specificity with 1000 bootstrapped samples but is the sample being correctly identified changing with each run from 50, 100, and 1000 resamples? Lets see the data frames we have of predicted versus the healthy class that stays the same.
RF_df
RF_df2
RF_df3
Looks like the predicted class correctly identified the same sample in each run and misclassified the same sample in each run. The random forest runs had some warnings produced and more with the 1000 bootstrapped samples.
Lets get those overall genes from the importance of variables output and write it to csv to add the genomic data to the top genes.
DF <- data.frame(importance[1])
DF
Order in descending order.
DF$Gene <- row.names(DF)
DataTop5 <- DF[,c(2,1)]
row.names(DataTop5) <- NULL
DataTop5genes <- DataTop5[order(DataTop5$Overall, decreasing=T),]
DataTop5genes
GenesTop5 <- DataTop5genes[1:5,]
write.csv(GenesTop5, "top5GenesMyalgia.csv", row.names=F)
GenesTop5
Get the file we just wrote here and we will add the genecards.org summaries to those genes with alternate names if available to understand these top genes that change the most in myalgia chronic pain.
Thank you for viewing this file as we understand more about data science and genes’ RNA-seq data that identifies chronic pain in myalgia sufferers.
We have more to do, like get summary information to at least the top importance genes in this RF bootstrapped model, try the fold change based on median instead to weed out the outliers skewing data in our top importance gene, and try out some other models and algorithms and parameter tuning. We can use a bunch of features in the caret package and get better at analyzing and exploring our data. Plots could be useful with the package ggplot2 that is a good package to use to visualize the data as we work with it.
Note to all, that the results in the general user interface (gui) were much worse than how html knitr output the code chunks with 100% accuracy and perfect summary statistics. But in the rstudio gui, the percentages with the seed set to 567 all produced same results on 50,100, and 1000 bootstrap samples with 67% accuracy, 50% sensitivity, and 100% specificity, but when running the code in html knitr, the results showed a completely different output of 100%. There are some bugs to work out and that may be only in html and not in pdf, but will try now for pdf. The knitr wouldn’t output the file in pdf form due to the confusion matrix factors where ’the data cannot have more levels than the reference.’So, that was changed. Internet search showed that the reason the accuracy was 100% on html and pdf knitr and not in the Gui is because the r chunks have to be changed from cache=TRUE to cache=FALSE. The default is cache=TRUE. I am not sure why it showed 100% because none of my cached confusionMatrix results showed 100%. That actually didn’t work either for html but did work for pdf. The genes are also different on the html version, where here it is one ending in 81041 as most important, in html knitr version it is a different gene entirely and not one in this version’s top 5, it ends in 180818. This is not good. Not sure why that is yet. But the results cannot be reproducible if a bug in the programming interface is preventing reproducibility.
Thanks again!