Here is some Lyme disease data put together in data frame to test top 33 genes against the target class of four classes: healthy, acute infection, one month after infection, and six months after infection. Data retrieved from NCBI platform study GPL13667-15572 and using the series matrix gene expression data in that study from GSE145974_series_matrix all from the Lyme Disease study done more than five years ago. Put together this is a very large datafile to manipulate into the working data frame for this R markdown document. Get that file by clicking this file
R packages are some of the basics for machine learning to predict one of the four classes like dplyr, tidyr, caret, and kernlab. This runs through the caret package machine learning algorithms of linear discriminate analysis or LDA, support vector machines or SVM for radial and linear separately, decision trees in the rcart algorithm, random forest algorithm, and KNN.
The preset tuning uses 10 cross fold validations to take 9 samples and test on 10th sample in the training. The training sets all the same 80% data of 86 samples and testing on the testing set of 20% of the remaining data. With the metric also set to ‘Accuracy.’
Other tuning methods could be used to fine tune and get better accuracy. Summary model statistics for sensitivity per class and specificity as well as the variance for each model and basic plots to visualize.
Background on this data frame: the data has almost 20,000 genes and 86 samples that a fold change with basic division of the 3 classes of acute, 1 month infection, and 6 months of infection are divided by the healthy comparisons by mean of those samples in each class. This data is not balanced as there are far too few chronic case samples compared to the acute, new, and chronic infections.
The fold change of the genes were used to see how gene expression changes in the cell and body per infection changes compared to healthy maintenance levels of the gene expression values. There are networks of genes that work in combination, and there are genes that will be upregulated or downregulated and those top changes as the ordered list of top 10 and bottom 10 when ordered in highest to lowest gene expression values per class are selected. Common genes made the set 33 genes instead of 20 per class of 80 genes. These genes had hundreds to hundreds of thousands of times fold change in up regulation of expression or down regulation of expression. These are clues to finding genes that are biomarkers of disease, and in this case Lyme disease. At the end we will get the top genes from the best accuracy model and read those gene summaries for an idea of the changes the Lyme disease host go through when infected.
click ‘dataset’ to get this dataset
genes33 <- read.csv('dataframeML-class-33genes.csv', header=T, na.strings=c('',' ','na','NA'))
head(genes33)
## HPGD RNF168 CTXN3 FRS3 KHDRBS3 POU4F2
## 1 0.0000453905 -0.00027800 -0.00000207 -0.0001400 -0.000219464 -0.000424625
## 2 -0.0960063920 -0.08893681 -0.04171753 0.5222826 0.320739270 0.046180010
## 3 -0.1628251360 -0.10046721 0.24628321 0.0001400 0.138545035 0.909273735
## 4 0.3786586210 -0.23633862 -0.13472350 0.1984134 -0.019687891 -0.464481365
## 5 0.2273819170 -0.01087332 -0.22562329 -0.3893075 -0.051819205 -0.246123550
## 6 -0.0666730130 0.18056202 0.21416879 -0.6040711 0.096314075 0.152537220
## RGPD3 CABP1 CENPF DLG3 ENO1 F2
## 1 -0.00024600 0.00003390 0.000020504 -0.000759415 -0.000192163 0.000134228
## 2 0.26228500 0.20398974 -0.061241150 -0.133482646 -1.344003333 -0.076257705
## 3 -0.04079795 0.53017545 -0.331103565 -0.208074767 -0.628050333 0.407919501
## 4 -0.37593746 0.03769469 0.225760225 -0.164650345 0.118298210 -0.298144263
## 5 0.19060660 -0.22982526 -0.158033967 0.047922324 -0.327089150 -0.219089430
## 6 0.26550126 0.14261413 -0.449312565 0.024404717 -0.417879100 0.365434963
## GATC IGFALS OR52A4 PDZRN3 POU3F2 PSMF1
## 1 -0.000256000 -0.000343000 -0.00027200 5.150233e-05 -0.00003100 -0.00008070
## 2 -0.127287860 0.113092660 0.06674075 9.941888e-02 0.23925924 0.07158557
## 3 -0.152866360 0.423058270 0.27201056 3.188891e-01 0.12777781 -0.45017878
## 4 -0.359013560 0.101755140 -0.26390958 -1.342481e-01 -0.07923627 0.05658539
## 5 -0.038307190 0.496858360 0.01595378 -1.581577e-01 -0.10329986 -0.05105789
## 6 -0.002932549 0.007001638 0.08342671 8.387852e-02 -0.05624676 0.04468243
## ISG20 KCNJ16 ESYT1 MAP2K7 TMEM194A HECW1
## 1 -0.00000525 0.00022185 -0.0005580 -0.000249864 -0.000213862 -0.000307797
## 2 -0.81921244 -0.14524805 -0.2626066 0.140063432 0.000878652 -0.085825205
## 3 -1.26617620 0.11858976 0.1145935 0.605252648 -0.014608858 0.706030240
## 4 0.22721529 -0.29451060 0.2984104 0.041792538 0.044892470 -0.165934685
## 5 0.11260080 -0.22437679 0.2822208 -0.020431372 -0.248416507 -0.276568890
## 6 -0.07338428 0.63788391 0.5622082 0.423514219 -0.073202532 0.296823980
## LOC400657 SLC1A1 CLEC2L CYP7B1 FAM162A NUDT18
## 1 -0.00065100 0.000299095 0.00003390 0.00007770 0.0000765325 0.001091003
## 2 -0.42919350 -0.032329679 -0.05779099 -0.13945055 -0.6245107500 -0.235054730
## 3 -0.12039113 0.131755352 0.23713994 0.48965120 -0.7001379900 0.881111150
## 4 -0.45719910 -0.076371432 -0.29611158 0.11624050 0.0864962350 -0.086553570
## 5 -0.13633752 -0.239856365 -0.34715940 -0.07544518 0.8462015400 -0.254269360
## 6 -0.06637859 0.177237513 0.17190409 0.17962694 -0.2148986000 1.138130700
## OTOS PEX26 PRR24 class
## 1 0.0002070 8.380425e-05 -0.0005050 healthy
## 2 -0.2211900 -2.901953e-02 -0.5969741 healthy
## 3 0.8309946 1.364853e-01 -0.5586877 healthy
## 4 -0.1290300 -1.274686e-02 -0.1085486 healthy
## 5 -0.1959085 1.454494e-01 -0.3364162 healthy
## 6 0.2405429 2.032977e-02 0.3638177 healthy
This data is numeric with a need to change the class target feature into a factor as it needs to be a factor and not a character class.
summary(genes33)
## HPGD RNF168 CTXN3 FRS3
## Min. :-0.31984 Min. :-1.04822 Min. :-0.253198 Min. :-0.63466
## 1st Qu.:-0.08600 1st Qu.:-0.21080 1st Qu.:-0.081222 1st Qu.:-0.16258
## Median : 0.01072 Median : 0.00000 Median : 0.005715 Median : 0.00000
## Mean : 0.05300 Mean : 0.02671 Mean : 0.042498 Mean : 0.03239
## 3rd Qu.: 0.16581 3rd Qu.: 0.25133 3rd Qu.: 0.127744 3rd Qu.: 0.23532
## Max. : 0.72735 Max. : 1.15288 Max. : 0.790707 Max. : 0.89720
## KHDRBS3 POU4F2 RGPD3 CABP1
## Min. :-0.44025 Min. :-0.46448 Min. :-0.575496 Min. :-0.66681
## 1st Qu.:-0.10231 1st Qu.:-0.11446 1st Qu.:-0.206396 1st Qu.:-0.23453
## Median :-0.00458 Median : 0.02521 Median : 0.000000 Median : 0.00000
## Mean : 0.03926 Mean : 0.04024 Mean : 0.004775 Mean : 0.01761
## 3rd Qu.: 0.15134 3rd Qu.: 0.13089 3rd Qu.: 0.127204 3rd Qu.: 0.20924
## Max. : 0.74125 Max. : 0.90927 Max. : 1.071542 Max. : 0.79515
## CENPF DLG3 ENO1 F2
## Min. :-0.85451 Min. :-0.613058 Min. :-1.470905 Min. :-0.40990
## 1st Qu.:-0.19664 1st Qu.:-0.184464 1st Qu.:-0.324407 1st Qu.:-0.17769
## Median :-0.02914 Median :-0.006313 Median :-0.006482 Median : 0.00176
## Mean : 0.03726 Mean : 0.021518 Mean :-0.049029 Mean : 0.03227
## 3rd Qu.: 0.24548 3rd Qu.: 0.209471 3rd Qu.: 0.247317 3rd Qu.: 0.18877
## Max. : 1.55048 Max. : 0.842132 Max. : 0.889147 Max. : 0.76103
## GATC IGFALS OR52A4 PDZRN3
## Min. :-1.21408 Min. :-0.87045 Min. :-0.668849 Min. :-0.27905
## 1st Qu.:-0.20289 1st Qu.:-0.26604 1st Qu.:-0.174138 1st Qu.:-0.09242
## Median : 0.00000 Median : 0.00000 Median : 0.000000 Median : 0.01071
## Mean : 0.04151 Mean : 0.00202 Mean : 0.005875 Mean : 0.04342
## 3rd Qu.: 0.22343 3rd Qu.: 0.21561 3rd Qu.: 0.128767 3rd Qu.: 0.14553
## Max. : 1.22686 Max. : 0.87783 Max. : 1.217344 Max. : 0.78095
## POU3F2 PSMF1 ISG20
## Min. :-0.42853 Min. :-0.637856 Min. :-2.281e+00
## 1st Qu.:-0.11388 1st Qu.:-0.183982 1st Qu.:-3.640e-01
## Median : 0.00000 Median : 0.031119 Median :-2.400e-07
## Mean : 0.05579 Mean :-0.003446 Mean :-1.294e-02
## 3rd Qu.: 0.14947 3rd Qu.: 0.167941 3rd Qu.: 3.336e-01
## Max. : 1.55298 Max. : 0.599328 Max. : 1.347e+00
## KCNJ16 ESYT1 MAP2K7 TMEM194A
## Min. :-0.334332 Min. :-1.26733 Min. :-0.89929 Min. :-0.620697
## 1st Qu.:-0.133381 1st Qu.:-0.19163 1st Qu.:-0.23288 1st Qu.:-0.106241
## Median : 0.004698 Median : 0.00000 Median :-0.01951 Median : 0.002695
## Mean : 0.072905 Mean : 0.05526 Mean :-0.01275 Mean : 0.044047
## 3rd Qu.: 0.231967 3rd Qu.: 0.34674 3rd Qu.: 0.17043 3rd Qu.: 0.218735
## Max. : 1.146404 Max. : 2.03143 Max. : 0.92057 Max. : 0.649779
## HECW1 LOC400657 SLC1A1 CLEC2L
## Min. :-0.43745 Min. :-0.75434 Min. :-0.37749 Min. :-0.79850
## 1st Qu.:-0.13235 1st Qu.:-0.19917 1st Qu.:-0.07584 1st Qu.:-0.27268
## Median : 0.01114 Median : 0.00000 Median : 0.02260 Median : 0.00000
## Mean : 0.04916 Mean :-0.01098 Mean : 0.03665 Mean : 0.01269
## 3rd Qu.: 0.18338 3rd Qu.: 0.19267 3rd Qu.: 0.13522 3rd Qu.: 0.20342
## Max. : 1.12452 Max. : 1.39459 Max. : 0.69026 Max. : 1.08305
## CYP7B1 FAM162A NUDT18 OTOS
## Min. :-0.70375 Min. :-0.758900 Min. :-8.531e-01 Min. :-0.6481
## 1st Qu.:-0.16720 1st Qu.:-0.246998 1st Qu.:-2.411e-01 1st Qu.:-0.1497
## Median : 0.00000 Median : 0.003583 Median : 1.200e-07 Median : 0.0000
## Mean : 0.02685 Mean : 0.011434 Mean : 3.641e-02 Mean : 0.0624
## 3rd Qu.: 0.17932 3rd Qu.: 0.250876 3rd Qu.: 2.974e-01 3rd Qu.: 0.2135
## Max. : 1.73670 Max. : 0.989418 Max. : 1.138e+00 Max. : 1.1077
## PEX26 PRR24 class
## Min. :-0.553936 Min. :-0.99676 Length:86
## 1st Qu.:-0.103130 1st Qu.:-0.26198 Class :character
## Median :-0.013050 Median : 0.00000 Mode :character
## Mean : 0.002015 Mean :-0.01048
## 3rd Qu.: 0.162993 3rd Qu.: 0.23194
## Max. : 0.487500 Max. : 1.24504
genes33$class <- as.factor(genes33$class)
class(genes33$class)
## [1] "factor"
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
intraining <- createDataPartition(genes33$class, p=.80, list=F)
training <- genes33[intraining,]
testing <- genes33[-intraining,]
We can see the training set values in each class and the testing set values that we are testing our model’s accuracy in prediction of class based on these genes and the model’s specific algorithm but all having same metric and cross validation settings of Accuracy and 10 respectively.
training %>% group_by(class) %>% count(class)
## # A tibble: 4 × 2
## # Groups: class [4]
## class n
## <fct> <int>
## 1 1 month infected 22
## 2 6 months infected 8
## 3 acute infection 23
## 4 healthy 17
testing %>% group_by(class) %>% count(class)
## # A tibble: 4 × 2
## # Groups: class [4]
## class n
## <fct> <int>
## 1 1 month infected 5
## 2 6 months infected 2
## 3 acute infection 5
## 4 healthy 4
control = trainControl(method='cv', number=10)
metric="Accuracy"
Setting the seed before each run will make sure the same computation is as it is done in this lab run.
set.seed(4567)
fitlda <- train(class~., data=training, method='lda', metric=metric, trControl=control)
predictLDA <- predict(fitlda, testing)
data <- data.frame(predictLDA, type=testing$class)
data
## predictLDA type
## 1 acute infection healthy
## 2 1 month infected healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 acute infection acute infection
## 6 6 months infected acute infection
## 7 acute infection acute infection
## 8 acute infection acute infection
## 9 1 month infected acute infection
## 10 acute infection 1 month infected
## 11 1 month infected 1 month infected
## 12 acute infection 1 month infected
## 13 acute infection 1 month infected
## 14 acute infection 1 month infected
## 15 6 months infected 6 months infected
## 16 6 months infected 6 months infected
The data frame above shows how well the model predicted each class and what the actual class was in the testing set that the model was supposed to predict with high accuracy. We can run the other models and see which one scored best in accuracy.
confusionMatrix(predictLDA,testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 month infected 6 months infected acute infection healthy
## 1 month infected 1 0 1 1
## 6 months infected 0 2 1 0
## acute infection 4 0 3 1
## healthy 0 0 0 2
##
## Overall Statistics
##
## Accuracy : 0.5
## 95% CI : (0.2465, 0.7535)
## No Information Rate : 0.3125
## P-Value [Acc > NIR] : 0.0918
##
## Kappa : 0.3155
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 month infected Class: 6 months infected
## Sensitivity 0.2000 1.0000
## Specificity 0.8182 0.9286
## Pos Pred Value 0.3333 0.6667
## Neg Pred Value 0.6923 1.0000
## Prevalence 0.3125 0.1250
## Detection Rate 0.0625 0.1250
## Detection Prevalence 0.1875 0.1875
## Balanced Accuracy 0.5091 0.9643
## Class: acute infection Class: healthy
## Sensitivity 0.6000 0.5000
## Specificity 0.5455 1.0000
## Pos Pred Value 0.3750 1.0000
## Neg Pred Value 0.7500 0.8571
## Prevalence 0.3125 0.2500
## Detection Rate 0.1875 0.1250
## Detection Prevalence 0.5000 0.1250
## Balanced Accuracy 0.5727 0.7500
set.seed(4567)
fitcart <- train(class~., data=training, method='rpart', metric=metric, trControl=control)
predictCart <- predict(fitcart,testing)
data <- data.frame(predictCart, type= testing$class)
data
## predictCart type
## 1 acute infection healthy
## 2 1 month infected healthy
## 3 1 month infected healthy
## 4 healthy healthy
## 5 healthy acute infection
## 6 healthy acute infection
## 7 acute infection acute infection
## 8 acute infection acute infection
## 9 acute infection acute infection
## 10 healthy 1 month infected
## 11 1 month infected 1 month infected
## 12 acute infection 1 month infected
## 13 healthy 1 month infected
## 14 1 month infected 1 month infected
## 15 1 month infected 6 months infected
## 16 healthy 6 months infected
confusionMatrix(predictCart,testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 month infected 6 months infected acute infection healthy
## 1 month infected 2 1 0 2
## 6 months infected 0 0 0 0
## acute infection 1 0 3 1
## healthy 2 1 2 1
##
## Overall Statistics
##
## Accuracy : 0.375
## 95% CI : (0.152, 0.6457)
## No Information Rate : 0.3125
## P-Value [Acc > NIR] : 0.382
##
## Kappa : 0.1209
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 month infected Class: 6 months infected
## Sensitivity 0.4000 0.000
## Specificity 0.7273 1.000
## Pos Pred Value 0.4000 NaN
## Neg Pred Value 0.7273 0.875
## Prevalence 0.3125 0.125
## Detection Rate 0.1250 0.000
## Detection Prevalence 0.3125 0.000
## Balanced Accuracy 0.5636 0.500
## Class: acute infection Class: healthy
## Sensitivity 0.6000 0.2500
## Specificity 0.8182 0.5833
## Pos Pred Value 0.6000 0.1667
## Neg Pred Value 0.8182 0.7000
## Prevalence 0.3125 0.2500
## Detection Rate 0.1875 0.0625
## Detection Prevalence 0.3125 0.3750
## Balanced Accuracy 0.7091 0.4167
set.seed(4567)
fitknn <- train(class~., data=training, method='knn', metric=metric, trControl=control)
predictKNN <- predict(fitknn,testing)
data <- data.frame(predictKNN, type=testing$class)
data
## predictKNN type
## 1 healthy healthy
## 2 1 month infected healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 1 month infected acute infection
## 6 healthy acute infection
## 7 acute infection acute infection
## 8 1 month infected acute infection
## 9 1 month infected acute infection
## 10 healthy 1 month infected
## 11 acute infection 1 month infected
## 12 acute infection 1 month infected
## 13 1 month infected 1 month infected
## 14 healthy 1 month infected
## 15 1 month infected 6 months infected
## 16 6 months infected 6 months infected
confusionMatrix(predictKNN, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 month infected 6 months infected acute infection healthy
## 1 month infected 1 1 3 1
## 6 months infected 0 1 0 0
## acute infection 2 0 1 0
## healthy 2 0 1 3
##
## Overall Statistics
##
## Accuracy : 0.375
## 95% CI : (0.152, 0.6457)
## No Information Rate : 0.3125
## P-Value [Acc > NIR] : 0.382
##
## Kappa : 0.1351
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 month infected Class: 6 months infected
## Sensitivity 0.2000 0.5000
## Specificity 0.5455 1.0000
## Pos Pred Value 0.1667 1.0000
## Neg Pred Value 0.6000 0.9333
## Prevalence 0.3125 0.1250
## Detection Rate 0.0625 0.0625
## Detection Prevalence 0.3750 0.0625
## Balanced Accuracy 0.3727 0.7500
## Class: acute infection Class: healthy
## Sensitivity 0.2000 0.7500
## Specificity 0.8182 0.7500
## Pos Pred Value 0.3333 0.5000
## Neg Pred Value 0.6923 0.9000
## Prevalence 0.3125 0.2500
## Detection Rate 0.0625 0.1875
## Detection Prevalence 0.1875 0.3750
## Balanced Accuracy 0.5091 0.7500
set.seed(4567)
fitRF <- train(class~., data=training, method='rf', metric=metric, trControl=control)
predictRF <- predict(fitRF,testing)
data <- data.frame(predictRF, type=testing$class)
data
## predictRF type
## 1 acute infection healthy
## 2 1 month infected healthy
## 3 1 month infected healthy
## 4 healthy healthy
## 5 acute infection acute infection
## 6 acute infection acute infection
## 7 acute infection acute infection
## 8 acute infection acute infection
## 9 acute infection acute infection
## 10 healthy 1 month infected
## 11 1 month infected 1 month infected
## 12 acute infection 1 month infected
## 13 acute infection 1 month infected
## 14 1 month infected 1 month infected
## 15 1 month infected 6 months infected
## 16 6 months infected 6 months infected
confusionMatrix(predictRF,testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 month infected 6 months infected acute infection healthy
## 1 month infected 2 1 0 2
## 6 months infected 0 1 0 0
## acute infection 2 0 5 1
## healthy 1 0 0 1
##
## Overall Statistics
##
## Accuracy : 0.5625
## 95% CI : (0.2988, 0.8025)
## No Information Rate : 0.3125
## P-Value [Acc > NIR] : 0.03338
##
## Kappa : 0.3812
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 month infected Class: 6 months infected
## Sensitivity 0.4000 0.5000
## Specificity 0.7273 1.0000
## Pos Pred Value 0.4000 1.0000
## Neg Pred Value 0.7273 0.9333
## Prevalence 0.3125 0.1250
## Detection Rate 0.1250 0.0625
## Detection Prevalence 0.3125 0.0625
## Balanced Accuracy 0.5636 0.7500
## Class: acute infection Class: healthy
## Sensitivity 1.0000 0.2500
## Specificity 0.7273 0.9167
## Pos Pred Value 0.6250 0.5000
## Neg Pred Value 1.0000 0.7857
## Prevalence 0.3125 0.2500
## Detection Rate 0.3125 0.0625
## Detection Prevalence 0.5000 0.1250
## Balanced Accuracy 0.8636 0.5833
set.seed(4567)
fitsvmRadial <- train(class~., data=training, method='svmRadial', metric=metric, trControl=control)
predictSVMradial <- predict(fitsvmRadial, testing)
data <- data.frame(predictSVMradial, type=testing$class)
data
## predictSVMradial type
## 1 acute infection healthy
## 2 1 month infected healthy
## 3 acute infection healthy
## 4 1 month infected healthy
## 5 acute infection acute infection
## 6 acute infection acute infection
## 7 acute infection acute infection
## 8 acute infection acute infection
## 9 acute infection acute infection
## 10 acute infection 1 month infected
## 11 1 month infected 1 month infected
## 12 acute infection 1 month infected
## 13 acute infection 1 month infected
## 14 acute infection 1 month infected
## 15 1 month infected 6 months infected
## 16 healthy 6 months infected
confusionMatrix(predictSVMradial, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 month infected 6 months infected acute infection healthy
## 1 month infected 1 1 0 2
## 6 months infected 0 0 0 0
## acute infection 4 0 5 2
## healthy 0 1 0 0
##
## Overall Statistics
##
## Accuracy : 0.375
## 95% CI : (0.152, 0.6457)
## No Information Rate : 0.3125
## P-Value [Acc > NIR] : 0.382
##
## Kappa : 0.096
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 month infected Class: 6 months infected
## Sensitivity 0.2000 0.000
## Specificity 0.7273 1.000
## Pos Pred Value 0.2500 NaN
## Neg Pred Value 0.6667 0.875
## Prevalence 0.3125 0.125
## Detection Rate 0.0625 0.000
## Detection Prevalence 0.2500 0.000
## Balanced Accuracy 0.4636 0.500
## Class: acute infection Class: healthy
## Sensitivity 1.0000 0.0000
## Specificity 0.4545 0.9167
## Pos Pred Value 0.4545 0.0000
## Neg Pred Value 1.0000 0.7333
## Prevalence 0.3125 0.2500
## Detection Rate 0.3125 0.0000
## Detection Prevalence 0.6875 0.0625
## Balanced Accuracy 0.7273 0.4583
fitsvmLinear <- train(class~., data=training, method='svmLinear', metric=metric, trControl=control)
predictSVMlinear <- predict(fitsvmLinear,testing)
data <- data.frame(predictSVMlinear, type=testing$class)
data
## predictSVMlinear type
## 1 acute infection healthy
## 2 acute infection healthy
## 3 healthy healthy
## 4 healthy healthy
## 5 6 months infected acute infection
## 6 healthy acute infection
## 7 1 month infected acute infection
## 8 acute infection acute infection
## 9 1 month infected acute infection
## 10 healthy 1 month infected
## 11 acute infection 1 month infected
## 12 acute infection 1 month infected
## 13 acute infection 1 month infected
## 14 acute infection 1 month infected
## 15 healthy 6 months infected
## 16 6 months infected 6 months infected
confusionMatrix(predictSVMlinear,testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 month infected 6 months infected acute infection healthy
## 1 month infected 0 0 2 0
## 6 months infected 0 1 1 0
## acute infection 4 0 1 2
## healthy 1 1 1 2
##
## Overall Statistics
##
## Accuracy : 0.25
## 95% CI : (0.0727, 0.5238)
## No Information Rate : 0.3125
## P-Value [Acc > NIR] : 0.7866
##
## Kappa : -0.0267
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 month infected Class: 6 months infected
## Sensitivity 0.0000 0.5000
## Specificity 0.8182 0.9286
## Pos Pred Value 0.0000 0.5000
## Neg Pred Value 0.6429 0.9286
## Prevalence 0.3125 0.1250
## Detection Rate 0.0000 0.0625
## Detection Prevalence 0.1250 0.1250
## Balanced Accuracy 0.4091 0.7143
## Class: acute infection Class: healthy
## Sensitivity 0.2000 0.5000
## Specificity 0.4545 0.7500
## Pos Pred Value 0.1429 0.4000
## Neg Pred Value 0.5556 0.8182
## Prevalence 0.3125 0.2500
## Detection Rate 0.0625 0.1250
## Detection Prevalence 0.4375 0.3125
## Balanced Accuracy 0.3273 0.6250
results <- resamples(list(lda=fitlda, rpart=fitcart, knn=fitknn, rf=fitRF, svmR=fitsvmRadial, svmL=fitsvmLinear))
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: lda, rpart, knn, rf, svmR, svmL
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.1428571 0.2857143 0.3303571 0.3630952 0.4151786 0.6666667 0
## rpart 0.0000000 0.4464286 0.5000000 0.5166667 0.6428571 0.8571429 0
## knn 0.1250000 0.2857143 0.4285714 0.4202381 0.5535714 0.6666667 0
## rf 0.2857143 0.4464286 0.5357143 0.5452381 0.5714286 0.8571429 0
## svmR 0.1428571 0.3080357 0.4017857 0.4369048 0.5714286 0.6666667 0
## svmL 0.1250000 0.3333333 0.3541667 0.4142857 0.5535714 0.7142857 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda -0.13513514 0.006944444 0.05934343 0.1154922 0.2201087 0.5000000 0
## rpart -0.32432432 0.253289474 0.29661836 0.3223737 0.4791667 0.8000000 0
## knn -0.19148936 0.040570175 0.21111111 0.1976820 0.3625000 0.5000000 0
## rf -0.02941176 0.204575163 0.35217391 0.3517747 0.4166667 0.7941176 0
## svmR -0.23529412 0.022727273 0.13368984 0.1939661 0.4125000 0.5000000 0
## svmL -0.19148936 0.041904762 0.07692308 0.1798827 0.3525145 0.6216216 0
dotplot(results)
importanceRcart <- varImp(fitcart)
importanceRcart
## rpart variable importance
##
## only 20 most important variables shown (out of 33)
##
## Overall
## PEX26 100.00
## OTOS 76.92
## RNF168 75.37
## CENPF 57.82
## ESYT1 34.65
## ENO1 34.55
## DLG3 27.20
## FAM162A 0.00
## CTXN3 0.00
## F2 0.00
## PSMF1 0.00
## TMEM194A 0.00
## CLEC2L 0.00
## POU3F2 0.00
## HECW1 0.00
## SLC1A1 0.00
## KHDRBS3 0.00
## PDZRN3 0.00
## ISG20 0.00
## KCNJ16 0.00
plot(importanceRcart)
importanceLDA <- varImp(fitlda)
importanceLDA
## ROC curve variable importance
##
## variables are sorted by maximum importance across the classes
## only 20 most important variables shown (out of 33)
##
## X1.month.infected X6.months.infected acute.infection healthy
## RNF168 96.02 96.015 100.00 25.996
## ISG20 88.74 88.741 91.65 38.765
## ESYT1 85.83 85.832 85.83 33.080
## CENPF 62.43 14.548 80.52 62.429
## DLG3 65.46 65.465 65.46 51.773
## PEX26 43.64 64.780 43.64 64.780
## OTOS 29.54 63.411 17.46 63.411
## ENO1 33.46 44.242 62.43 44.242
## PSMF1 45.10 45.098 62.43 31.235
## KHDRBS3 45.10 45.098 51.30 9.328
## CABP1 50.92 50.917 50.92 45.611
## HECW1 21.82 50.404 21.82 50.404
## GATC 36.37 36.369 49.91 14.120
## NUDT18 18.91 30.550 47.12 30.550
## F2 45.10 45.098 45.10 20.281
## CTXN3 26.19 26.186 44.34 15.370
## POU4F2 41.68 18.912 18.91 41.682
## CLEC2L 27.64 40.819 27.64 40.819
## FRS3 33.46 33.460 38.77 5.756
## IGFALS 37.13 1.797 20.68 37.128
plot(importanceLDA)
importanceKNN <- varImp(fitknn)
importanceKNN
## ROC curve variable importance
##
## variables are sorted by maximum importance across the classes
## only 20 most important variables shown (out of 33)
##
## X1.month.infected X6.months.infected acute.infection healthy
## RNF168 96.02 96.015 100.00 25.996
## ISG20 88.74 88.741 91.65 38.765
## ESYT1 85.83 85.832 85.83 33.080
## CENPF 62.43 14.548 80.52 62.429
## DLG3 65.46 65.465 65.46 51.773
## PEX26 43.64 64.780 43.64 64.780
## OTOS 29.54 63.411 17.46 63.411
## ENO1 33.46 44.242 62.43 44.242
## PSMF1 45.10 45.098 62.43 31.235
## KHDRBS3 45.10 45.098 51.30 9.328
## CABP1 50.92 50.917 50.92 45.611
## HECW1 21.82 50.404 21.82 50.404
## GATC 36.37 36.369 49.91 14.120
## NUDT18 18.91 30.550 47.12 30.550
## F2 45.10 45.098 45.10 20.281
## CTXN3 26.19 26.186 44.34 15.370
## POU4F2 41.68 18.912 18.91 41.682
## CLEC2L 27.64 40.819 27.64 40.819
## FRS3 33.46 33.460 38.77 5.756
## IGFALS 37.13 1.797 20.68 37.128
importanceRF <- varImp(fitRF)
importanceRF
## rf variable importance
##
## only 20 most important variables shown (out of 33)
##
## Overall
## CENPF 100.00
## PEX26 66.77
## RNF168 53.37
## OTOS 42.75
## DLG3 37.44
## ESYT1 36.89
## ENO1 29.15
## ISG20 27.57
## POU4F2 22.90
## CYP7B1 20.97
## SLC1A1 20.69
## CABP1 14.96
## IGFALS 14.76
## KCNJ16 13.46
## FAM162A 13.41
## HECW1 13.25
## PSMF1 11.79
## POU3F2 11.42
## NUDT18 11.28
## GATC 8.96
importanceSVM_radial <- varImp(fitsvmRadial)
importanceSVM_radial
## ROC curve variable importance
##
## variables are sorted by maximum importance across the classes
## only 20 most important variables shown (out of 33)
##
## X1.month.infected X6.months.infected acute.infection healthy
## RNF168 96.02 96.015 100.00 25.996
## ISG20 88.74 88.741 91.65 38.765
## ESYT1 85.83 85.832 85.83 33.080
## CENPF 62.43 14.548 80.52 62.429
## DLG3 65.46 65.465 65.46 51.773
## PEX26 43.64 64.780 43.64 64.780
## OTOS 29.54 63.411 17.46 63.411
## ENO1 33.46 44.242 62.43 44.242
## PSMF1 45.10 45.098 62.43 31.235
## KHDRBS3 45.10 45.098 51.30 9.328
## CABP1 50.92 50.917 50.92 45.611
## HECW1 21.82 50.404 21.82 50.404
## GATC 36.37 36.369 49.91 14.120
## NUDT18 18.91 30.550 47.12 30.550
## F2 45.10 45.098 45.10 20.281
## CTXN3 26.19 26.186 44.34 15.370
## POU4F2 41.68 18.912 18.91 41.682
## CLEC2L 27.64 40.819 27.64 40.819
## FRS3 33.46 33.460 38.77 5.756
## IGFALS 37.13 1.797 20.68 37.128
plot(importanceSVM_radial)
importanceSVM_linear <- varImp(fitsvmLinear)
importanceSVM_linear
## ROC curve variable importance
##
## variables are sorted by maximum importance across the classes
## only 20 most important variables shown (out of 33)
##
## X1.month.infected X6.months.infected acute.infection healthy
## RNF168 96.02 96.015 100.00 25.996
## ISG20 88.74 88.741 91.65 38.765
## ESYT1 85.83 85.832 85.83 33.080
## CENPF 62.43 14.548 80.52 62.429
## DLG3 65.46 65.465 65.46 51.773
## PEX26 43.64 64.780 43.64 64.780
## OTOS 29.54 63.411 17.46 63.411
## PSMF1 45.10 45.098 62.43 31.235
## ENO1 33.46 44.242 62.43 44.242
## KHDRBS3 45.10 45.098 51.30 9.328
## CABP1 50.92 50.917 50.92 45.611
## HECW1 21.82 50.404 21.82 50.404
## GATC 36.37 36.369 49.91 14.120
## NUDT18 18.91 30.550 47.12 30.550
## F2 45.10 45.098 45.10 20.281
## CTXN3 26.19 26.186 44.34 15.370
## POU4F2 41.68 18.912 18.91 41.682
## CLEC2L 27.64 40.819 27.64 40.819
## FRS3 33.46 33.460 38.77 5.756
## IGFALS 37.13 1.797 20.68 37.128
In this file going back, the rcart algorithm scored the best accuracy of 42% using our set.seed value of 4567, 10 folds of cross validation, and Accuracy for the metric. The top 5 genes for the rcart are:
importanceRcart
## rpart variable importance
##
## only 20 most important variables shown (out of 33)
##
## Overall
## PEX26 100.00
## OTOS 76.92
## RNF168 75.37
## CENPF 57.82
## ESYT1 34.65
## ENO1 34.55
## DLG3 27.20
## SLC1A1 0.00
## HECW1 0.00
## LOC400657 0.00
## CTXN3 0.00
## GATC 0.00
## POU4F2 0.00
## IGFALS 0.00
## HPGD 0.00
## FAM162A 0.00
## CYP7B1 0.00
## MAP2K7 0.00
## KCNJ16 0.00
## RGPD3 0.00
CENFP, ENO1, ESYT1, RNF168, and CYP7B1 are the top 5 genes for classifying our target classes from the model with the best accuracy. Lets read in the gene summaries and select those genes from that data. This was manually imputed from Genecards.org. Get the file made by me by clicking gene summaries
summary33 <- read.csv('33geneschrsummaries.csv', header=T, na.strings=c('',' ', 'na','NA'))
topGenes5 <- c("CENPF","ENO1","ESYT1","RNF168","CYP7B1")
best5 <- summary33[summary33$Gene %in% topGenes5,]
best5$Gene
## [1] "CENPF" "CYP7B1" "ENO1" "ESYT1" "RNF168"
best5
## Gene Chromosomal.Location
## 2 CENPF chr1q32-q41
## 5 CYP7B1 chr8q21.3
## 7 ENO1 chr1p36.3-p36.2
## 8 ESYT1 chr12q13.2
## 31 RNF168 chr3q29
## GeneCardsGeneSummary
## 2 CENPF (Centromere Protein F) is a Protein Coding gene. Diseases associated with CENPF include Stromme Syndrome and Corpus Callosum, Agenesis Of. Among its related pathways are EML4 and NUDC in mitotic spindle formation and Separation of Sister Chromatids. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and transcription factor binding.
## 5 GeneCards Symbol: CYP7B1 2 \nCytochrome P450 Family 7 Subfamily B Member 1 2 3 5\nCYP7B1 (Cytochrome P450 Family 7 Subfamily B Member 1) is a Protein Coding gene. Diseases associated with CYP7B1 include Spastic Paraplegia 5A, Autosomal Recessive and Bile Acid Synthesis Defect, Congenital, 3. Among its related pathways are Metapathway biotransformation Phase I and II and Bile acid and bile salt metabolism. Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen. An important paralog of this gene is CYP7A1.
## 7 GeneCards Symbol: ENO1 2 \nEnolase 1 2 3 4 5\n\nENO1 (Enolase 1) is a Protein Coding gene. Diseases associated with ENO1 include Glycogen Storage Disease Xiii and Lung Cancer Susceptibility 3. Among its related pathways are glycolysis (BioCyc) and Gluconeogenesis. Gene Ontology (GO) annotations related to this gene include RNA binding and transcription corepressor activity. An important paralog of this gene is ENO2.\n
## 8 Aliases for ESYT1 Gene\nGeneCards Symbol: ESYT1 2 \nExtended Synaptotagmin 1 2 3 5\nMBC2 2 3 4 5\nKIAA0747 2 4 5\nFAM62A 3 4 5 \nESYT1 (Extended Synaptotagmin 1) is a Protein Coding gene. Diseases associated with ESYT1 include Developmental And Epileptic Encephalopathy 26 and Stormorken Syndrome. Among its related pathways are Signaling by Rho GTPases and RAC2 GTPase cycle. Gene Ontology (GO) annotations related to this gene include lipid binding and phospholipid binding. An important paralog of this gene is ESYT2.
## 31 GeneCards Symbol: RNF168 2 \nRing Finger Protein 168 RNF168 (Ring Finger Protein 168) is a Protein Coding gene. Diseases associated with RNF168 include Riddle Syndrome and Digeorge Syndrome. Among its related pathways are HDR through Homologous Recombination (HRR) or Single Strand Annealing (SSA) and DNA Double Strand Break Response. Gene Ontology (GO) annotations related to this gene include chromatin binding and ubiquitin-protein transferase activity. An important paralog of this gene is RNF169.
## NCBI_GeneSummary
## 2 This gene encodes a protein that associates with the centromere-kinetochore complex. The protein is a component of the nuclear matrix during the G2 phase of interphase. In late G2 the protein associates with the kinetochore and maintains this association through early anaphase. It localizes to the spindle midzone and the intracellular bridge in late anaphase and telophase, respectively, and is thought to be subsequently degraded. The localization of this protein suggests that it may play a role in chromosome segregation during mitotis. It is thought to form either a homodimer or heterodimer. Autoantibodies against this protein have been found in patients with cancer or graft versus host disease. [provided by RefSeq, Jul 2008]
## 5 This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. This endoplasmic reticulum membrane protein catalyzes the first reaction in the cholesterol catabolic pathway of extrahepatic tissues, which converts cholesterol to bile acids. This enzyme likely plays a minor role in total bile acid synthesis, but may also be involved in the development of atherosclerosis, neurosteroid metabolism and sex hormone synthesis. Mutations in this gene have been associated with hereditary spastic paraplegia (SPG5 or HSP), an autosomal recessive disorder. [provided by RefSeq, Apr 2016]
## 7 This gene encodes alpha-enolase, one of three enolase isoenzymes found in mammals. Each isoenzyme is a homodimer composed of 2 alpha, 2 gamma, or 2 beta subunits, and functions as a glycolytic enzyme. Alpha-enolase in addition, functions as a structural lens protein (tau-crystallin) in the monomeric form. Alternative splicing of this gene results in a shorter isoform that has been shown to bind to the c-myc promoter and function as a tumor suppressor. Several pseudogenes have been identified, including one on the long arm of chromosome 1. Alpha-enolase has also been identified as an autoantigen in Hashimoto encephalopathy. [provided by RefSeq, Jan 2011]
## 8 Enables identical protein binding activity and phospholipid transfer activity. Involved in intermembrane lipid transfer. Located in endoplasmic reticulum membrane. [provided by Alliance of Genome Resources, Jul 2025]
## 31 This gene encodes an E3 ubiquitin ligase protein that contains a RING finger, a motif present in a variety of functionally distinct proteins and known to be involved in protein-DNA and protein-protein interactions. The protein is involved in DNA double-strand break (DSB) repair. Mutations in this gene result in Riddle syndrome. [provided by RefSeq, Sep 2011]
## UniProtKB_Swiss.Prot_GeneSummary
## 2 Required for kinetochore function and chromosome segregation in mitosis. Required for kinetochore localization of dynein, LIS1, NDE1 and NDEL1. Regulates recycling of the plasma membrane by acting as a link between recycling vesicles and the microtubule network though its association with STX4 and SNAP25. Acts as a potential inhibitor of pocket protein-mediated cellular processes during development by regulating the activity of RB proteins during cell division and proliferation. May play a regulatory or permissive role in the normal embryonic cardiomyocyte cell cycle and in promoting continued mitosis in transformed, abnormally dividing neonatal cardiomyocytes. Interaction with RB directs embryonic stem cells toward a cardiac lineage. Involved in the regulation of DNA synthesis and hence cell cycle progression, via its C-terminus. Has a potential role regulating skeletal myogenesis and in cell differentiation in embryogenesis. Involved in dendritic cell regulation of T-cell immunity against chlamydia. ( CENPF_HUMAN,P49454 )
## 5 A cytochrome P450 monooxygenase involved in the metabolism of endogenous oxysterols and steroid hormones, including neurosteroids (PubMed:10588945, 24491228). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via cytochrome P450 reductase (CPR; NADPH-ferrihemoprotein reductase) (PubMed:10588945, 24491228). Catalyzes the hydroxylation of carbon hydrogen bonds of steroids with a preference for 7-alpha position (PubMed:10588945, 24491228). Usually metabolizes steroids carrying a hydroxy group at position 3, functioning as a 3-hydroxy steroid 7-alpha hydroxylase (PubMed:24491228). Hydroxylates oxysterols, including 25-hydroxycholesterol and (25R)-cholest-5-ene-3beta,26-diol toward 7-alpha hydroxy derivatives, which may be transported to the liver and converted to bile acids (PubMed:10588945, 9802883). Via its product 7-alpha,25-dihydroxycholesterol, a ligand for the chemotactic G protein-coupled receptor GPR183/EBI2, regulates B cell migration in germinal centers of lymphoid organs, thus guiding efficient maturation of plasma B cells and overall antigen-specific humoral immune response (By similarity). 7-alpha hydroxylates neurosteroids, including 3beta-hydroxyandrost-5-en-17-one (dehydroepiandrosterone) and pregnenolone, both involved in hippocampus-associated memory and learning (PubMed:24491228). Metabolizes androstanoids toward 6- or 7-alpha hydroxy derivatives (PubMed:24491228). ( CP7B1_HUMAN,O75881 )
## 7 Glycolytic enzyme the catalyzes the conversion of 2-phosphoglycerate to phosphoenolpyruvate (PubMed:1369209, 29775581). In addition to glycolysis, involved in various processes such as growth control, hypoxia tolerance and allergic responses (PubMed:10802057, 12666133, 2005901, 29775581). May also function in the intravascular and pericellular fibrinolytic system due to its ability to serve as a receptor and activator of plasminogen on the cell surface of several cell-types such as leukocytes and neurons (PubMed:12666133). Stimulates immunoglobulin production (PubMed:1369209). ( ENOA_HUMAN,P06733 )\n\n[Isoform MBP-1]: Binds to the myc promoter and acts as a transcriptional repressor. May be a tumor suppressor. ( ENOA_HUMAN,P06733 )
## 8 Binds calcium (via the C2 domains) and translocates to sites of contact between the endoplasmic reticulum and the cell membrane in response to increased cytosolic calcium levels (PubMed:23791178, 24183667). Helps tether the endoplasmic reticulum to the cell membrane and promotes the formation of appositions between the endoplasmic reticulum and the cell membrane (PubMed:24183667). Acts as an inhibitor of ADGRD1 G-protein-coupled receptor activity in absence of cytosolic calcium (PubMed:38758649). Binds glycerophospholipids in a barrel-like domain and may play a role in cellular lipid transport (By similarity). ( ESYT1_HUMAN,Q9BSJ8 )
## 31 E3 ubiquitin-protein ligase required for accumulation of repair proteins to sites of DNA damage. Acts with UBE2N/UBC13 to amplify the RNF8-dependent histone ubiquitination. Recruited to sites of DNA damage at double-strand breaks (DSBs) by binding to ubiquitinated histone H2A and H2AX and amplifies the RNF8-dependent H2A ubiquitination, promoting the formation of 'Lys-63'-linked ubiquitin conjugates. This leads to concentrate ubiquitinated histones H2A and H2AX at DNA lesions to the threshold required for recruitment of TP53BP1 and BRCA1. Also recruited at DNA interstrand cross-links (ICLs) sites and promotes accumulation of 'Lys-63'-linked ubiquitination of histones H2A and H2AX, leading to recruitment of FAAP20/C1orf86 and Fanconi anemia (FA) complex, followed by interstrand cross-link repair. H2A ubiquitination also mediates the ATM-dependent transcriptional silencing at regions flanking DSBs in cis, a mechanism to avoid collision between transcription and repair intermediates. Also involved in class switch recombination in immune system, via its role in regulation of DSBs repair. Following DNA damage, promotes the ubiquitination and degradation of JMJD2A/KDM4A in collaboration with RNF8, leading to unmask H4K20me2 mark and promote the recruitment of TP53BP1 at DNA damage sites. Not able to initiate 'Lys-63'-linked ubiquitination in vitro; possibly due to partial occlusion of the UBE2N/UBC13-binding region. Catalyzes monoubiquitination of 'Lys-13' and 'Lys-15' of nucleosomal histone H2A (H2AK13Ub and H2AK15Ub, respectively). ( RN168_HUMAN,Q8IYW5 )
## overallGeneSummary
## 2 involved in cell replication of mitosis early phases and when destroyed by autonomic responses the host has cancer, gene located on long arm q of chromosome 1.
## 5 encodes a protein that is involved in network of processes with cholesterol and drug digestion as well as creating bile acids from cholesterol to digest food macromolecules like fats and proteins. Found in plasma membrane of endoplasmic reticulum of liver. Gene located on long arm q of chromosome 8.
## 7 involved in breaking down fat stores for energy in glycolysis and tumor supressing and defective or involved in thyroid disease of genetic Hashimoto's syndrome. This gene is located on short arm p of chromosome 1.
## 8 relates to lipid binding and phospholipid binding and lipid membrane transport in endoplasmic reticulum membrane and cytosol membrane, takes up excess calcium in cytosol of cell and transports it to endoplasmic reticulum membrane, inhibits activity of a coupled receptor called the ADGRD1 G-protein-coupled receptor in absence of calcium in cytosol. AI search returned that ADGRD1 G is a protein found upregulated in trascription when in states of hypoxia and related to glioblastoma growth when upregulated, also it is responsible for many physiological functions in the body and muscle growth. This gene is located on long arm q of chromosome 12.
## 31 This gene is a DNA repairing gene that seeks damaged DNA and binds the chromatin at the histones of wrapped DNA to mend the damaged DNA and replace nucleotide sequences or if not repairable then to communicate to TP53 and BRC1 proteins of cell death needed for that cellular DNA. Gene located on long arm q of chromosome 3.
The top 5 genes have summaries that say they are involved in the cellular processes for replication, bile formation from cholesterol and drug digestion, tumor suppressing and glycolysis, cell membrane lipid binding, and repairing DNA damage. Seems like the genes selected are being most expressed or under expressed when infected with Lyme disease at various stages up to six months post infection. Are these genes being more up regulated or down regulated? Lets see from the data on overall change from acute infection vs chronic infection in fold changes to healthy sample means. The file for fold change values in sample comparisons can be found by clicking here for data
foldchanges <- read.csv('top33genes100varsChrSumms.csv', header=T, na.strings=c('',' ','na','NA'))
colnames(foldchanges)
## [1] "Gene"
## [2] "healthyControl_1"
## [3] "healthyControl_2"
## [4] "healthyControl_3"
## [5] "healthyControl_4"
## [6] "healthyControl_5"
## [7] "healthyControl_6"
## [8] "healthyControl_7"
## [9] "healthyControl_8"
## [10] "healthyControl_9"
## [11] "healthyControl_10"
## [12] "healthyControl_11"
## [13] "healthyControl_12"
## [14] "healthyControl_13"
## [15] "healthyControl_14"
## [16] "healthyControl_15"
## [17] "healthyControl_16"
## [18] "healthyControl_17"
## [19] "healthyControl_18"
## [20] "healthyControl_19"
## [21] "healthyControl_20"
## [22] "healthyControl_21"
## [23] "acuteLymeDisease_1"
## [24] "acuteLymeDisease_2"
## [25] "acuteLymeDisease_3"
## [26] "acuteLymeDisease_4"
## [27] "acuteLymeDisease_5"
## [28] "acuteLymeDisease_6"
## [29] "acuteLymeDisease_7"
## [30] "acuteLymeDisease_8"
## [31] "acuteLymeDisease_9"
## [32] "acuteLymeDisease_10"
## [33] "acuteLymeDisease_11"
## [34] "acuteLymeDisease_12"
## [35] "acuteLymeDisease_13"
## [36] "acuteLymeDisease_14"
## [37] "acuteLymeDisease_15"
## [38] "acuteLymeDisease_16"
## [39] "acuteLymeDisease_17"
## [40] "acuteLymeDisease_18"
## [41] "acuteLymeDisease_19"
## [42] "acuteLymeDisease_20"
## [43] "acuteLymeDisease_21"
## [44] "acuteLymeDisease_22"
## [45] "acuteLymeDisease_23"
## [46] "acuteLymeDisease_24"
## [47] "acuteLymeDisease_25"
## [48] "acuteLymeDisease_26"
## [49] "acuteLymeDisease_27"
## [50] "acuteLymeDisease_28"
## [51] "Antibodies_1month_1"
## [52] "Antibodies_1month_2"
## [53] "Antibodies_1month_3"
## [54] "Antibodies_1month_4"
## [55] "Antibodies_1month_5"
## [56] "Antibodies_1month_6"
## [57] "Antibodies_1month_7"
## [58] "Antibodies_1month_8"
## [59] "Antibodies_1month_9"
## [60] "Antibodies_1month_10"
## [61] "Antibodies_1month_11"
## [62] "Antibodies_1month_12"
## [63] "Antibodies_1month_13"
## [64] "Antibodies_1month_14"
## [65] "Antibodies_1month_15"
## [66] "Antibodies_1month_16"
## [67] "Antibodies_1month_17"
## [68] "Antibodies_1month_18"
## [69] "Antibodies_1month_19"
## [70] "Antibodies_1month_20"
## [71] "Antibodies_1month_21"
## [72] "Antibodies_1month_22"
## [73] "Antibodies_1month_23"
## [74] "Antibodies_1month_24"
## [75] "Antibodies_1month_25"
## [76] "Antibodies_1month_26"
## [77] "Antibodies_1month_27"
## [78] "Antibodies_6months_1"
## [79] "Antibodies_6months_2"
## [80] "Antibodies_6months_3"
## [81] "Antibodies_6months_4"
## [82] "Antibodies_6months_5"
## [83] "Antibodies_6months_6"
## [84] "Antibodies_6months_7"
## [85] "Antibodies_6months_8"
## [86] "Antibodies_6months_9"
## [87] "Antibodies_6months_10"
## [88] "healthy_Mean"
## [89] "acuteLymeDisease_Mean"
## [90] "antibodies_1month_Mean"
## [91] "antibodies_6month_Mean"
## [92] "acuteHealthy_foldChange"
## [93] "antibodies_1month_healthy_foldChange"
## [94] "antibodies_6month_healthy_foldchange"
## [95] "acute_vs_6month_foldchange"
## [96] "Chromosomal.Location"
## [97] "GeneCardsGeneSummary"
## [98] "NCBI_GeneSummary"
## [99] "UniProtKB_Swiss.Prot_GeneSummary"
## [100] "overallGeneSummary"
changes <- foldchanges[foldchanges$Gene %in% topGenes5,]
changes6monthsVsHealthy <- changes[,c(1,94)]
changes6monthsVsHealthy
## Gene antibodies_6month_healthy_foldchange
## 2 CENPF -6896.970
## 5 CYP7B1 1067.072
## 7 ENO1 1646.317
## 8 ESYT1 1952.210
## 31 RNF168 2964.558
It looks like all were upregulated except one gene, CENPF, that was the top important feature in classification of the top algorithm’s basic settings for rcart in the caret package. There was a foldchange increase of 1-2 thousand times healthy mean values for the genes CYP7B1, ENO1, ESYT1, and RNF168. The only gene downregulated was for mitotic activity in cell replication down regulated to almost 7,000 times the regular healthy mean values. Fat lipid metabolism and digestion to make bile and digest more fats in the liver as well as repairing DNA in the nucleus of the cell are the other top four genes that are upregulated after infection with Lyme disease.
This model scored 42% Accuracy in prediction of 4 classes or states of infection in Lyme disease. It can be made better by increasing accuracy by tuning the model, and so can the other models. As when using a different seed in a previous run the random forest, another ensemble model, scored the best at 62% and the others were closer to 50%. It depends on the randomness generated with the set.seed function as well as parameters tuned, like more cross validation folds, or a different metric other than Accuracy. Looking at the class metric statistis for sensitivity and specificity, lets pull back up the confusion matrix for the rcart model.
rcartDF <- data.frame(predictCart, type=testing$class)
rcartDF
## predictCart type
## 1 acute infection healthy
## 2 1 month infected healthy
## 3 1 month infected healthy
## 4 healthy healthy
## 5 healthy acute infection
## 6 healthy acute infection
## 7 acute infection acute infection
## 8 acute infection acute infection
## 9 acute infection acute infection
## 10 healthy 1 month infected
## 11 1 month infected 1 month infected
## 12 acute infection 1 month infected
## 13 healthy 1 month infected
## 14 1 month infected 1 month infected
## 15 1 month infected 6 months infected
## 16 healthy 6 months infected
We see from data above that none of the samples of 6 months infection are actually in the testing set. Which is strange because the testing set count of each class said there were some six month infected samples in it. Lets double check.
testing %>% group_by(class) %>% count(class)
## # A tibble: 4 × 2
## # Groups: class [4]
## class n
## <fct> <int>
## 1 1 month infected 5
## 2 6 months infected 2
## 3 acute infection 5
## 4 healthy 4
Well, thats good to know it was my error in not scrolling to 2nd page in window. So there are 2 samples of the 6 months infected class. There are a few samples in each class from our imbalanced data.
rcartDF %>% group_by(predictCart) %>% count(predictCart)
## # A tibble: 3 × 2
## # Groups: predictCart [3]
## predictCart n
## <fct> <int>
## 1 1 month infected 5
## 2 acute infection 5
## 3 healthy 6
The rcart algorithm, however, only predicted two classes, and left out the healthy and 6 month infected samples in favor of going with the majority class of 1 month infected. Lets just review how many samples are in each class from original data of 86 samples.
genes33 %>% group_by(class) %>% count(class)
## # A tibble: 4 × 2
## # Groups: class [4]
## class n
## <fct> <int>
## 1 1 month infected 27
## 2 6 months infected 10
## 3 acute infection 28
## 4 healthy 21
Actually, there are more acute infection samples by 28 compared to the 27 samples of 1 month infected. There are only 10 samples of the six months infected and 21 of the healthy samples in our data.
confusionMatrix(predictCart, testing$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 month infected 6 months infected acute infection healthy
## 1 month infected 2 1 0 2
## 6 months infected 0 0 0 0
## acute infection 1 0 3 1
## healthy 2 1 2 1
##
## Overall Statistics
##
## Accuracy : 0.375
## 95% CI : (0.152, 0.6457)
## No Information Rate : 0.3125
## P-Value [Acc > NIR] : 0.382
##
## Kappa : 0.1209
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 month infected Class: 6 months infected
## Sensitivity 0.4000 0.000
## Specificity 0.7273 1.000
## Pos Pred Value 0.4000 NaN
## Neg Pred Value 0.7273 0.875
## Prevalence 0.3125 0.125
## Detection Rate 0.1250 0.000
## Detection Prevalence 0.3125 0.000
## Balanced Accuracy 0.5636 0.500
## Class: acute infection Class: healthy
## Sensitivity 0.6000 0.2500
## Specificity 0.8182 0.5833
## Pos Pred Value 0.6000 0.1667
## Neg Pred Value 0.8182 0.7000
## Prevalence 0.3125 0.2500
## Detection Rate 0.1875 0.0625
## Detection Prevalence 0.3125 0.3750
## Balanced Accuracy 0.7091 0.4167
By looking at the summary statistics, the specificity to rule in a class based on these genes (in rcart) was 100% for healthy and also for 6 months infection as those were not ruled in at all for any of the 2 samples of 6 months infected or the 4 samples of healthy status. While the sensitivity or snout to rule out in 1 month infected is 60% and 80% for acute infection. This wasn’t the best model overall as it just uses the odds in numbers in the data on the surface, but there are decisions and boundaries that are made in analyzing the data that the model builds upon to improve and get the best prediction.
For now, we can just say, as we do see these same genes for the most part in the other important variable features of the other models selected. So in first 6 months of Lyme disease assume cell replication slows while the DNA repair is enhanced and metabolic activity to have more lipids in the body. Which is what is used in the cell membrane of our cells in all systems of our bodies. Fat transport to make hormones, steroids, and cell membranes, bile production to digest fats and lipids. We are slowing down cell replication while speeding up DNA repair and also sending more byproducts into the body to build cell outer structure, use endocrine and reproductive hormones as well as steroids. People on chemotherapy are given steroids like prednisone. And chemotherapy slows cell replication. It looks like overall treatment of cancer is the same mechanism the body does in self repair in healing as shown in top genes being expressed in healthy, acute infection, one month post infection, and six months post infection or chronic infection states.
Thanks so much, more can be done to tune this model schematic, but this helps expand on revelations of this rpub document and research using R.