33 genes 4 classes LymeDisease with 6 Models

Here is some Lyme disease data put together in data frame to test top 33 genes against the target class of four classes: healthy, acute infection, one month after infection, and six months after infection. Data retrieved from NCBI platform study GPL13667-15572 and using the series matrix gene expression data in that study from GSE145974_series_matrix all from the Lyme Disease study done more than five years ago. Put together this is a very large datafile to manipulate into the working data frame for this R markdown document. Get that file by clicking this file

R packages are some of the basics for machine learning to predict one of the four classes like dplyr, tidyr, caret, and kernlab. This runs through the caret package machine learning algorithms of linear discriminate analysis or LDA, support vector machines or SVM for radial and linear separately, decision trees in the rcart algorithm, random forest algorithm, and KNN.

The preset tuning uses 10 cross fold validations to take 9 samples and test on 10th sample in the training. The training sets all the same 80% data of 86 samples and testing on the testing set of 20% of the remaining data. With the metric also set to ‘Accuracy.’

Other tuning methods could be used to fine tune and get better accuracy. Summary model statistics for sensitivity per class and specificity as well as the variance for each model and basic plots to visualize.

Background on this data frame: the data has almost 20,000 genes and 86 samples that a fold change with basic division of the 3 classes of acute, 1 month infection, and 6 months of infection are divided by the healthy comparisons by mean of those samples in each class. This data is not balanced as there are far too few chronic case samples compared to the acute, new, and chronic infections.

The fold change of the genes were used to see how gene expression changes in the cell and body per infection changes compared to healthy maintenance levels of the gene expression values. There are networks of genes that work in combination, and there are genes that will be upregulated or downregulated and those top changes as the ordered list of top 10 and bottom 10 when ordered in highest to lowest gene expression values per class are selected. Common genes made the set 33 genes instead of 20 per class of 80 genes. These genes had hundreds to hundreds of thousands of times fold change in up regulation of expression or down regulation of expression. These are clues to finding genes that are biomarkers of disease, and in this case Lyme disease. At the end we will get the top genes from the best accuracy model and read those gene summaries for an idea of the changes the Lyme disease host go through when infected.

click ‘dataset’ to get this dataset

genes33 <- read.csv('dataframeML-class-33genes.csv', header=T, na.strings=c('',' ','na','NA'))
head(genes33)

##            HPGD      RNF168       CTXN3       FRS3      KHDRBS3       POU4F2
## 1  0.0000453905 -0.00027800 -0.00000207 -0.0001400 -0.000219464 -0.000424625
## 2 -0.0960063920 -0.08893681 -0.04171753  0.5222826  0.320739270  0.046180010
## 3 -0.1628251360 -0.10046721  0.24628321  0.0001400  0.138545035  0.909273735
## 4  0.3786586210 -0.23633862 -0.13472350  0.1984134 -0.019687891 -0.464481365
## 5  0.2273819170 -0.01087332 -0.22562329 -0.3893075 -0.051819205 -0.246123550
## 6 -0.0666730130  0.18056202  0.21416879 -0.6040711  0.096314075  0.152537220
##         RGPD3       CABP1        CENPF         DLG3         ENO1           F2
## 1 -0.00024600  0.00003390  0.000020504 -0.000759415 -0.000192163  0.000134228
## 2  0.26228500  0.20398974 -0.061241150 -0.133482646 -1.344003333 -0.076257705
## 3 -0.04079795  0.53017545 -0.331103565 -0.208074767 -0.628050333  0.407919501
## 4 -0.37593746  0.03769469  0.225760225 -0.164650345  0.118298210 -0.298144263
## 5  0.19060660 -0.22982526 -0.158033967  0.047922324 -0.327089150 -0.219089430
## 6  0.26550126  0.14261413 -0.449312565  0.024404717 -0.417879100  0.365434963
##           GATC       IGFALS      OR52A4        PDZRN3      POU3F2       PSMF1
## 1 -0.000256000 -0.000343000 -0.00027200  5.150233e-05 -0.00003100 -0.00008070
## 2 -0.127287860  0.113092660  0.06674075  9.941888e-02  0.23925924  0.07158557
## 3 -0.152866360  0.423058270  0.27201056  3.188891e-01  0.12777781 -0.45017878
## 4 -0.359013560  0.101755140 -0.26390958 -1.342481e-01 -0.07923627  0.05658539
## 5 -0.038307190  0.496858360  0.01595378 -1.581577e-01 -0.10329986 -0.05105789
## 6 -0.002932549  0.007001638  0.08342671  8.387852e-02 -0.05624676  0.04468243
##         ISG20      KCNJ16      ESYT1       MAP2K7     TMEM194A        HECW1
## 1 -0.00000525  0.00022185 -0.0005580 -0.000249864 -0.000213862 -0.000307797
## 2 -0.81921244 -0.14524805 -0.2626066  0.140063432  0.000878652 -0.085825205
## 3 -1.26617620  0.11858976  0.1145935  0.605252648 -0.014608858  0.706030240
## 4  0.22721529 -0.29451060  0.2984104  0.041792538  0.044892470 -0.165934685
## 5  0.11260080 -0.22437679  0.2822208 -0.020431372 -0.248416507 -0.276568890
## 6 -0.07338428  0.63788391  0.5622082  0.423514219 -0.073202532  0.296823980
##     LOC400657       SLC1A1      CLEC2L      CYP7B1       FAM162A       NUDT18
## 1 -0.00065100  0.000299095  0.00003390  0.00007770  0.0000765325  0.001091003
## 2 -0.42919350 -0.032329679 -0.05779099 -0.13945055 -0.6245107500 -0.235054730
## 3 -0.12039113  0.131755352  0.23713994  0.48965120 -0.7001379900  0.881111150
## 4 -0.45719910 -0.076371432 -0.29611158  0.11624050  0.0864962350 -0.086553570
## 5 -0.13633752 -0.239856365 -0.34715940 -0.07544518  0.8462015400 -0.254269360
## 6 -0.06637859  0.177237513  0.17190409  0.17962694 -0.2148986000  1.138130700
##         OTOS         PEX26      PRR24   class
## 1  0.0002070  8.380425e-05 -0.0005050 healthy
## 2 -0.2211900 -2.901953e-02 -0.5969741 healthy
## 3  0.8309946  1.364853e-01 -0.5586877 healthy
## 4 -0.1290300 -1.274686e-02 -0.1085486 healthy
## 5 -0.1959085  1.454494e-01 -0.3364162 healthy
## 6  0.2405429  2.032977e-02  0.3638177 healthy

This data is numeric with a need to change the class target feature into a factor as it needs to be a factor and not a character class.

summary(genes33)

##       HPGD              RNF168             CTXN3                FRS3         
##  Min.   :-0.31984   Min.   :-1.04822   Min.   :-0.253198   Min.   :-0.63466  
##  1st Qu.:-0.08600   1st Qu.:-0.21080   1st Qu.:-0.081222   1st Qu.:-0.16258  
##  Median : 0.01072   Median : 0.00000   Median : 0.005715   Median : 0.00000  
##  Mean   : 0.05300   Mean   : 0.02671   Mean   : 0.042498   Mean   : 0.03239  
##  3rd Qu.: 0.16581   3rd Qu.: 0.25133   3rd Qu.: 0.127744   3rd Qu.: 0.23532  
##  Max.   : 0.72735   Max.   : 1.15288   Max.   : 0.790707   Max.   : 0.89720  
##     KHDRBS3             POU4F2             RGPD3               CABP1         
##  Min.   :-0.44025   Min.   :-0.46448   Min.   :-0.575496   Min.   :-0.66681  
##  1st Qu.:-0.10231   1st Qu.:-0.11446   1st Qu.:-0.206396   1st Qu.:-0.23453  
##  Median :-0.00458   Median : 0.02521   Median : 0.000000   Median : 0.00000  
##  Mean   : 0.03926   Mean   : 0.04024   Mean   : 0.004775   Mean   : 0.01761  
##  3rd Qu.: 0.15134   3rd Qu.: 0.13089   3rd Qu.: 0.127204   3rd Qu.: 0.20924  
##  Max.   : 0.74125   Max.   : 0.90927   Max.   : 1.071542   Max.   : 0.79515  
##      CENPF               DLG3                ENO1                 F2          
##  Min.   :-0.85451   Min.   :-0.613058   Min.   :-1.470905   Min.   :-0.40990  
##  1st Qu.:-0.19664   1st Qu.:-0.184464   1st Qu.:-0.324407   1st Qu.:-0.17769  
##  Median :-0.02914   Median :-0.006313   Median :-0.006482   Median : 0.00176  
##  Mean   : 0.03726   Mean   : 0.021518   Mean   :-0.049029   Mean   : 0.03227  
##  3rd Qu.: 0.24548   3rd Qu.: 0.209471   3rd Qu.: 0.247317   3rd Qu.: 0.18877  
##  Max.   : 1.55048   Max.   : 0.842132   Max.   : 0.889147   Max.   : 0.76103  
##       GATC              IGFALS             OR52A4              PDZRN3        
##  Min.   :-1.21408   Min.   :-0.87045   Min.   :-0.668849   Min.   :-0.27905  
##  1st Qu.:-0.20289   1st Qu.:-0.26604   1st Qu.:-0.174138   1st Qu.:-0.09242  
##  Median : 0.00000   Median : 0.00000   Median : 0.000000   Median : 0.01071  
##  Mean   : 0.04151   Mean   : 0.00202   Mean   : 0.005875   Mean   : 0.04342  
##  3rd Qu.: 0.22343   3rd Qu.: 0.21561   3rd Qu.: 0.128767   3rd Qu.: 0.14553  
##  Max.   : 1.22686   Max.   : 0.87783   Max.   : 1.217344   Max.   : 0.78095  
##      POU3F2             PSMF1               ISG20           
##  Min.   :-0.42853   Min.   :-0.637856   Min.   :-2.281e+00  
##  1st Qu.:-0.11388   1st Qu.:-0.183982   1st Qu.:-3.640e-01  
##  Median : 0.00000   Median : 0.031119   Median :-2.400e-07  
##  Mean   : 0.05579   Mean   :-0.003446   Mean   :-1.294e-02  
##  3rd Qu.: 0.14947   3rd Qu.: 0.167941   3rd Qu.: 3.336e-01  
##  Max.   : 1.55298   Max.   : 0.599328   Max.   : 1.347e+00  
##      KCNJ16              ESYT1              MAP2K7            TMEM194A        
##  Min.   :-0.334332   Min.   :-1.26733   Min.   :-0.89929   Min.   :-0.620697  
##  1st Qu.:-0.133381   1st Qu.:-0.19163   1st Qu.:-0.23288   1st Qu.:-0.106241  
##  Median : 0.004698   Median : 0.00000   Median :-0.01951   Median : 0.002695  
##  Mean   : 0.072905   Mean   : 0.05526   Mean   :-0.01275   Mean   : 0.044047  
##  3rd Qu.: 0.231967   3rd Qu.: 0.34674   3rd Qu.: 0.17043   3rd Qu.: 0.218735  
##  Max.   : 1.146404   Max.   : 2.03143   Max.   : 0.92057   Max.   : 0.649779  
##      HECW1            LOC400657            SLC1A1             CLEC2L        
##  Min.   :-0.43745   Min.   :-0.75434   Min.   :-0.37749   Min.   :-0.79850  
##  1st Qu.:-0.13235   1st Qu.:-0.19917   1st Qu.:-0.07584   1st Qu.:-0.27268  
##  Median : 0.01114   Median : 0.00000   Median : 0.02260   Median : 0.00000  
##  Mean   : 0.04916   Mean   :-0.01098   Mean   : 0.03665   Mean   : 0.01269  
##  3rd Qu.: 0.18338   3rd Qu.: 0.19267   3rd Qu.: 0.13522   3rd Qu.: 0.20342  
##  Max.   : 1.12452   Max.   : 1.39459   Max.   : 0.69026   Max.   : 1.08305  
##      CYP7B1            FAM162A              NUDT18                OTOS        
##  Min.   :-0.70375   Min.   :-0.758900   Min.   :-8.531e-01   Min.   :-0.6481  
##  1st Qu.:-0.16720   1st Qu.:-0.246998   1st Qu.:-2.411e-01   1st Qu.:-0.1497  
##  Median : 0.00000   Median : 0.003583   Median : 1.200e-07   Median : 0.0000  
##  Mean   : 0.02685   Mean   : 0.011434   Mean   : 3.641e-02   Mean   : 0.0624  
##  3rd Qu.: 0.17932   3rd Qu.: 0.250876   3rd Qu.: 2.974e-01   3rd Qu.: 0.2135  
##  Max.   : 1.73670   Max.   : 0.989418   Max.   : 1.138e+00   Max.   : 1.1077  
##      PEX26               PRR24             class          
##  Min.   :-0.553936   Min.   :-0.99676   Length:86         
##  1st Qu.:-0.103130   1st Qu.:-0.26198   Class :character  
##  Median :-0.013050   Median : 0.00000   Mode  :character  
##  Mean   : 0.002015   Mean   :-0.01048                     
##  3rd Qu.: 0.162993   3rd Qu.: 0.23194                     
##  Max.   : 0.487500   Max.   : 1.24504

genes33$class <- as.factor(genes33$class)
class(genes33$class)

## [1] "factor"

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

intraining <- createDataPartition(genes33$class, p=.80, list=F)
training <- genes33[intraining,]
testing <- genes33[-intraining,]

We can see the training set values in each class and the testing set values that we are testing our model’s accuracy in prediction of class based on these genes and the model’s specific algorithm but all having same metric and cross validation settings of Accuracy and 10 respectively.

training %>% group_by(class) %>% count(class)

## # A tibble: 4 × 2
## # Groups:   class [4]
##   class                 n
##   <fct>             <int>
## 1 1 month infected     22
## 2 6 months infected     8
## 3 acute infection      23
## 4 healthy              17

testing %>% group_by(class) %>% count(class)

## # A tibble: 4 × 2
## # Groups:   class [4]
##   class                 n
##   <fct>             <int>
## 1 1 month infected      5
## 2 6 months infected     2
## 3 acute infection       5
## 4 healthy               4

control = trainControl(method='cv', number=10)
metric="Accuracy"

Setting the seed before each run will make sure the same computation is as it is done in this lab run.

set.seed(4567)
fitlda <- train(class~., data=training, method='lda', metric=metric, trControl=control)
predictLDA <- predict(fitlda, testing)

data <- data.frame(predictLDA, type=testing$class)
data

##           predictLDA              type
## 1    acute infection           healthy
## 2   1 month infected           healthy
## 3            healthy           healthy
## 4            healthy           healthy
## 5    acute infection   acute infection
## 6  6 months infected   acute infection
## 7    acute infection   acute infection
## 8    acute infection   acute infection
## 9   1 month infected   acute infection
## 10   acute infection  1 month infected
## 11  1 month infected  1 month infected
## 12   acute infection  1 month infected
## 13   acute infection  1 month infected
## 14   acute infection  1 month infected
## 15 6 months infected 6 months infected
## 16 6 months infected 6 months infected

The data frame above shows how well the model predicted each class and what the actual class was in the testing set that the model was supposed to predict with high accuracy. We can run the other models and see which one scored best in accuracy.

confusionMatrix(predictLDA,testing$class)

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          1 month infected 6 months infected acute infection healthy
##   1 month infected                 1                 0               1       1
##   6 months infected                0                 2               1       0
##   acute infection                  4                 0               3       1
##   healthy                          0                 0               0       2
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.2465, 0.7535)
##     No Information Rate : 0.3125          
##     P-Value [Acc > NIR] : 0.0918          
##                                           
##                   Kappa : 0.3155          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 month infected Class: 6 months infected
## Sensitivity                           0.2000                   1.0000
## Specificity                           0.8182                   0.9286
## Pos Pred Value                        0.3333                   0.6667
## Neg Pred Value                        0.6923                   1.0000
## Prevalence                            0.3125                   0.1250
## Detection Rate                        0.0625                   0.1250
## Detection Prevalence                  0.1875                   0.1875
## Balanced Accuracy                     0.5091                   0.9643
##                      Class: acute infection Class: healthy
## Sensitivity                          0.6000         0.5000
## Specificity                          0.5455         1.0000
## Pos Pred Value                       0.3750         1.0000
## Neg Pred Value                       0.7500         0.8571
## Prevalence                           0.3125         0.2500
## Detection Rate                       0.1875         0.1250
## Detection Prevalence                 0.5000         0.1250
## Balanced Accuracy                    0.5727         0.7500

set.seed(4567)
fitcart <- train(class~., data=training, method='rpart', metric=metric, trControl=control)
predictCart <- predict(fitcart,testing)
data <- data.frame(predictCart, type= testing$class)
data

##         predictCart              type
## 1   acute infection           healthy
## 2  1 month infected           healthy
## 3  1 month infected           healthy
## 4           healthy           healthy
## 5           healthy   acute infection
## 6           healthy   acute infection
## 7   acute infection   acute infection
## 8   acute infection   acute infection
## 9   acute infection   acute infection
## 10          healthy  1 month infected
## 11 1 month infected  1 month infected
## 12  acute infection  1 month infected
## 13          healthy  1 month infected
## 14 1 month infected  1 month infected
## 15 1 month infected 6 months infected
## 16          healthy 6 months infected

confusionMatrix(predictCart,testing$class)

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          1 month infected 6 months infected acute infection healthy
##   1 month infected                 2                 1               0       2
##   6 months infected                0                 0               0       0
##   acute infection                  1                 0               3       1
##   healthy                          2                 1               2       1
## 
## Overall Statistics
##                                          
##                Accuracy : 0.375          
##                  95% CI : (0.152, 0.6457)
##     No Information Rate : 0.3125         
##     P-Value [Acc > NIR] : 0.382          
##                                          
##                   Kappa : 0.1209         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 month infected Class: 6 months infected
## Sensitivity                           0.4000                    0.000
## Specificity                           0.7273                    1.000
## Pos Pred Value                        0.4000                      NaN
## Neg Pred Value                        0.7273                    0.875
## Prevalence                            0.3125                    0.125
## Detection Rate                        0.1250                    0.000
## Detection Prevalence                  0.3125                    0.000
## Balanced Accuracy                     0.5636                    0.500
##                      Class: acute infection Class: healthy
## Sensitivity                          0.6000         0.2500
## Specificity                          0.8182         0.5833
## Pos Pred Value                       0.6000         0.1667
## Neg Pred Value                       0.8182         0.7000
## Prevalence                           0.3125         0.2500
## Detection Rate                       0.1875         0.0625
## Detection Prevalence                 0.3125         0.3750
## Balanced Accuracy                    0.7091         0.4167

set.seed(4567)
fitknn <- train(class~., data=training, method='knn', metric=metric, trControl=control)

predictKNN <- predict(fitknn,testing)
data <- data.frame(predictKNN, type=testing$class)
data

##           predictKNN              type
## 1            healthy           healthy
## 2   1 month infected           healthy
## 3            healthy           healthy
## 4            healthy           healthy
## 5   1 month infected   acute infection
## 6            healthy   acute infection
## 7    acute infection   acute infection
## 8   1 month infected   acute infection
## 9   1 month infected   acute infection
## 10           healthy  1 month infected
## 11   acute infection  1 month infected
## 12   acute infection  1 month infected
## 13  1 month infected  1 month infected
## 14           healthy  1 month infected
## 15  1 month infected 6 months infected
## 16 6 months infected 6 months infected

confusionMatrix(predictKNN, testing$class)

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          1 month infected 6 months infected acute infection healthy
##   1 month infected                 1                 1               3       1
##   6 months infected                0                 1               0       0
##   acute infection                  2                 0               1       0
##   healthy                          2                 0               1       3
## 
## Overall Statistics
##                                          
##                Accuracy : 0.375          
##                  95% CI : (0.152, 0.6457)
##     No Information Rate : 0.3125         
##     P-Value [Acc > NIR] : 0.382          
##                                          
##                   Kappa : 0.1351         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 month infected Class: 6 months infected
## Sensitivity                           0.2000                   0.5000
## Specificity                           0.5455                   1.0000
## Pos Pred Value                        0.1667                   1.0000
## Neg Pred Value                        0.6000                   0.9333
## Prevalence                            0.3125                   0.1250
## Detection Rate                        0.0625                   0.0625
## Detection Prevalence                  0.3750                   0.0625
## Balanced Accuracy                     0.3727                   0.7500
##                      Class: acute infection Class: healthy
## Sensitivity                          0.2000         0.7500
## Specificity                          0.8182         0.7500
## Pos Pred Value                       0.3333         0.5000
## Neg Pred Value                       0.6923         0.9000
## Prevalence                           0.3125         0.2500
## Detection Rate                       0.0625         0.1875
## Detection Prevalence                 0.1875         0.3750
## Balanced Accuracy                    0.5091         0.7500

set.seed(4567)
fitRF <- train(class~., data=training, method='rf', metric=metric, trControl=control)
predictRF <- predict(fitRF,testing)
data <- data.frame(predictRF, type=testing$class)
data

##            predictRF              type
## 1    acute infection           healthy
## 2   1 month infected           healthy
## 3   1 month infected           healthy
## 4            healthy           healthy
## 5    acute infection   acute infection
## 6    acute infection   acute infection
## 7    acute infection   acute infection
## 8    acute infection   acute infection
## 9    acute infection   acute infection
## 10           healthy  1 month infected
## 11  1 month infected  1 month infected
## 12   acute infection  1 month infected
## 13   acute infection  1 month infected
## 14  1 month infected  1 month infected
## 15  1 month infected 6 months infected
## 16 6 months infected 6 months infected

confusionMatrix(predictRF,testing$class)

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          1 month infected 6 months infected acute infection healthy
##   1 month infected                 2                 1               0       2
##   6 months infected                0                 1               0       0
##   acute infection                  2                 0               5       1
##   healthy                          1                 0               0       1
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5625          
##                  95% CI : (0.2988, 0.8025)
##     No Information Rate : 0.3125          
##     P-Value [Acc > NIR] : 0.03338         
##                                           
##                   Kappa : 0.3812          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 month infected Class: 6 months infected
## Sensitivity                           0.4000                   0.5000
## Specificity                           0.7273                   1.0000
## Pos Pred Value                        0.4000                   1.0000
## Neg Pred Value                        0.7273                   0.9333
## Prevalence                            0.3125                   0.1250
## Detection Rate                        0.1250                   0.0625
## Detection Prevalence                  0.3125                   0.0625
## Balanced Accuracy                     0.5636                   0.7500
##                      Class: acute infection Class: healthy
## Sensitivity                          1.0000         0.2500
## Specificity                          0.7273         0.9167
## Pos Pred Value                       0.6250         0.5000
## Neg Pred Value                       1.0000         0.7857
## Prevalence                           0.3125         0.2500
## Detection Rate                       0.3125         0.0625
## Detection Prevalence                 0.5000         0.1250
## Balanced Accuracy                    0.8636         0.5833

set.seed(4567)
fitsvmRadial <- train(class~., data=training, method='svmRadial', metric=metric, trControl=control)

predictSVMradial <- predict(fitsvmRadial, testing)
data <- data.frame(predictSVMradial, type=testing$class)
data

##    predictSVMradial              type
## 1   acute infection           healthy
## 2  1 month infected           healthy
## 3   acute infection           healthy
## 4  1 month infected           healthy
## 5   acute infection   acute infection
## 6   acute infection   acute infection
## 7   acute infection   acute infection
## 8   acute infection   acute infection
## 9   acute infection   acute infection
## 10  acute infection  1 month infected
## 11 1 month infected  1 month infected
## 12  acute infection  1 month infected
## 13  acute infection  1 month infected
## 14  acute infection  1 month infected
## 15 1 month infected 6 months infected
## 16          healthy 6 months infected

confusionMatrix(predictSVMradial, testing$class)

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          1 month infected 6 months infected acute infection healthy
##   1 month infected                 1                 1               0       2
##   6 months infected                0                 0               0       0
##   acute infection                  4                 0               5       2
##   healthy                          0                 1               0       0
## 
## Overall Statistics
##                                          
##                Accuracy : 0.375          
##                  95% CI : (0.152, 0.6457)
##     No Information Rate : 0.3125         
##     P-Value [Acc > NIR] : 0.382          
##                                          
##                   Kappa : 0.096          
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 month infected Class: 6 months infected
## Sensitivity                           0.2000                    0.000
## Specificity                           0.7273                    1.000
## Pos Pred Value                        0.2500                      NaN
## Neg Pred Value                        0.6667                    0.875
## Prevalence                            0.3125                    0.125
## Detection Rate                        0.0625                    0.000
## Detection Prevalence                  0.2500                    0.000
## Balanced Accuracy                     0.4636                    0.500
##                      Class: acute infection Class: healthy
## Sensitivity                          1.0000         0.0000
## Specificity                          0.4545         0.9167
## Pos Pred Value                       0.4545         0.0000
## Neg Pred Value                       1.0000         0.7333
## Prevalence                           0.3125         0.2500
## Detection Rate                       0.3125         0.0000
## Detection Prevalence                 0.6875         0.0625
## Balanced Accuracy                    0.7273         0.4583

fitsvmLinear <- train(class~., data=training, method='svmLinear', metric=metric, trControl=control)
predictSVMlinear <- predict(fitsvmLinear,testing)
data <- data.frame(predictSVMlinear, type=testing$class)
data

##     predictSVMlinear              type
## 1    acute infection           healthy
## 2    acute infection           healthy
## 3            healthy           healthy
## 4            healthy           healthy
## 5  6 months infected   acute infection
## 6            healthy   acute infection
## 7   1 month infected   acute infection
## 8    acute infection   acute infection
## 9   1 month infected   acute infection
## 10           healthy  1 month infected
## 11   acute infection  1 month infected
## 12   acute infection  1 month infected
## 13   acute infection  1 month infected
## 14   acute infection  1 month infected
## 15           healthy 6 months infected
## 16 6 months infected 6 months infected

confusionMatrix(predictSVMlinear,testing$class)

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          1 month infected 6 months infected acute infection healthy
##   1 month infected                 0                 0               2       0
##   6 months infected                0                 1               1       0
##   acute infection                  4                 0               1       2
##   healthy                          1                 1               1       2
## 
## Overall Statistics
##                                           
##                Accuracy : 0.25            
##                  95% CI : (0.0727, 0.5238)
##     No Information Rate : 0.3125          
##     P-Value [Acc > NIR] : 0.7866          
##                                           
##                   Kappa : -0.0267         
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 month infected Class: 6 months infected
## Sensitivity                           0.0000                   0.5000
## Specificity                           0.8182                   0.9286
## Pos Pred Value                        0.0000                   0.5000
## Neg Pred Value                        0.6429                   0.9286
## Prevalence                            0.3125                   0.1250
## Detection Rate                        0.0000                   0.0625
## Detection Prevalence                  0.1250                   0.1250
## Balanced Accuracy                     0.4091                   0.7143
##                      Class: acute infection Class: healthy
## Sensitivity                          0.2000         0.5000
## Specificity                          0.4545         0.7500
## Pos Pred Value                       0.1429         0.4000
## Neg Pred Value                       0.5556         0.8182
## Prevalence                           0.3125         0.2500
## Detection Rate                       0.0625         0.1250
## Detection Prevalence                 0.4375         0.3125
## Balanced Accuracy                    0.3273         0.6250

results <- resamples(list(lda=fitlda, rpart=fitcart, knn=fitknn, rf=fitRF, svmR=fitsvmRadial, svmL=fitsvmLinear))
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, rpart, knn, rf, svmR, svmL 
## Number of resamples: 10 
## 
## Accuracy 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda   0.1428571 0.2857143 0.3303571 0.3630952 0.4151786 0.6666667    0
## rpart 0.0000000 0.4464286 0.5000000 0.5166667 0.6428571 0.8571429    0
## knn   0.1250000 0.2857143 0.4285714 0.4202381 0.5535714 0.6666667    0
## rf    0.2857143 0.4464286 0.5357143 0.5452381 0.5714286 0.8571429    0
## svmR  0.1428571 0.3080357 0.4017857 0.4369048 0.5714286 0.6666667    0
## svmL  0.1250000 0.3333333 0.3541667 0.4142857 0.5535714 0.7142857    0
## 
## Kappa 
##              Min.     1st Qu.     Median      Mean   3rd Qu.      Max. NA's
## lda   -0.13513514 0.006944444 0.05934343 0.1154922 0.2201087 0.5000000    0
## rpart -0.32432432 0.253289474 0.29661836 0.3223737 0.4791667 0.8000000    0
## knn   -0.19148936 0.040570175 0.21111111 0.1976820 0.3625000 0.5000000    0
## rf    -0.02941176 0.204575163 0.35217391 0.3517747 0.4166667 0.7941176    0
## svmR  -0.23529412 0.022727273 0.13368984 0.1939661 0.4125000 0.5000000    0
## svmL  -0.19148936 0.041904762 0.07692308 0.1798827 0.3525145 0.6216216    0

dotplot(results)

importanceRcart <- varImp(fitcart)

importanceRcart

## rpart variable importance
## 
##   only 20 most important variables shown (out of 33)
## 
##          Overall
## PEX26     100.00
## OTOS       76.92
## RNF168     75.37
## CENPF      57.82
## ESYT1      34.65
## ENO1       34.55
## DLG3       27.20
## FAM162A     0.00
## CTXN3       0.00
## F2          0.00
## PSMF1       0.00
## TMEM194A    0.00
## CLEC2L      0.00
## POU3F2      0.00
## HECW1       0.00
## SLC1A1      0.00
## KHDRBS3     0.00
## PDZRN3      0.00
## ISG20       0.00
## KCNJ16      0.00

plot(importanceRcart)

importanceLDA <- varImp(fitlda)
importanceLDA

## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##   only 20 most important variables shown (out of 33)
## 
##         X1.month.infected X6.months.infected acute.infection healthy
## RNF168              96.02             96.015          100.00  25.996
## ISG20               88.74             88.741           91.65  38.765
## ESYT1               85.83             85.832           85.83  33.080
## CENPF               62.43             14.548           80.52  62.429
## DLG3                65.46             65.465           65.46  51.773
## PEX26               43.64             64.780           43.64  64.780
## OTOS                29.54             63.411           17.46  63.411
## ENO1                33.46             44.242           62.43  44.242
## PSMF1               45.10             45.098           62.43  31.235
## KHDRBS3             45.10             45.098           51.30   9.328
## CABP1               50.92             50.917           50.92  45.611
## HECW1               21.82             50.404           21.82  50.404
## GATC                36.37             36.369           49.91  14.120
## NUDT18              18.91             30.550           47.12  30.550
## F2                  45.10             45.098           45.10  20.281
## CTXN3               26.19             26.186           44.34  15.370
## POU4F2              41.68             18.912           18.91  41.682
## CLEC2L              27.64             40.819           27.64  40.819
## FRS3                33.46             33.460           38.77   5.756
## IGFALS              37.13              1.797           20.68  37.128

plot(importanceLDA)

importanceKNN <- varImp(fitknn)
importanceKNN

## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##   only 20 most important variables shown (out of 33)
## 
##         X1.month.infected X6.months.infected acute.infection healthy
## RNF168              96.02             96.015          100.00  25.996
## ISG20               88.74             88.741           91.65  38.765
## ESYT1               85.83             85.832           85.83  33.080
## CENPF               62.43             14.548           80.52  62.429
## DLG3                65.46             65.465           65.46  51.773
## PEX26               43.64             64.780           43.64  64.780
## OTOS                29.54             63.411           17.46  63.411
## ENO1                33.46             44.242           62.43  44.242
## PSMF1               45.10             45.098           62.43  31.235
## KHDRBS3             45.10             45.098           51.30   9.328
## CABP1               50.92             50.917           50.92  45.611
## HECW1               21.82             50.404           21.82  50.404
## GATC                36.37             36.369           49.91  14.120
## NUDT18              18.91             30.550           47.12  30.550
## F2                  45.10             45.098           45.10  20.281
## CTXN3               26.19             26.186           44.34  15.370
## POU4F2              41.68             18.912           18.91  41.682
## CLEC2L              27.64             40.819           27.64  40.819
## FRS3                33.46             33.460           38.77   5.756
## IGFALS              37.13              1.797           20.68  37.128

importanceRF <- varImp(fitRF)
importanceRF

## rf variable importance
## 
##   only 20 most important variables shown (out of 33)
## 
##         Overall
## CENPF    100.00
## PEX26     66.77
## RNF168    53.37
## OTOS      42.75
## DLG3      37.44
## ESYT1     36.89
## ENO1      29.15
## ISG20     27.57
## POU4F2    22.90
## CYP7B1    20.97
## SLC1A1    20.69
## CABP1     14.96
## IGFALS    14.76
## KCNJ16    13.46
## FAM162A   13.41
## HECW1     13.25
## PSMF1     11.79
## POU3F2    11.42
## NUDT18    11.28
## GATC       8.96

importanceSVM_radial <- varImp(fitsvmRadial)
importanceSVM_radial

## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##   only 20 most important variables shown (out of 33)
## 
##         X1.month.infected X6.months.infected acute.infection healthy
## RNF168              96.02             96.015          100.00  25.996
## ISG20               88.74             88.741           91.65  38.765
## ESYT1               85.83             85.832           85.83  33.080
## CENPF               62.43             14.548           80.52  62.429
## DLG3                65.46             65.465           65.46  51.773
## PEX26               43.64             64.780           43.64  64.780
## OTOS                29.54             63.411           17.46  63.411
## ENO1                33.46             44.242           62.43  44.242
## PSMF1               45.10             45.098           62.43  31.235
## KHDRBS3             45.10             45.098           51.30   9.328
## CABP1               50.92             50.917           50.92  45.611
## HECW1               21.82             50.404           21.82  50.404
## GATC                36.37             36.369           49.91  14.120
## NUDT18              18.91             30.550           47.12  30.550
## F2                  45.10             45.098           45.10  20.281
## CTXN3               26.19             26.186           44.34  15.370
## POU4F2              41.68             18.912           18.91  41.682
## CLEC2L              27.64             40.819           27.64  40.819
## FRS3                33.46             33.460           38.77   5.756
## IGFALS              37.13              1.797           20.68  37.128

plot(importanceSVM_radial)

importanceSVM_linear <- varImp(fitsvmLinear)
importanceSVM_linear

## ROC curve variable importance
## 
##   variables are sorted by maximum importance across the classes
##   only 20 most important variables shown (out of 33)
## 
##         X1.month.infected X6.months.infected acute.infection healthy
## RNF168              96.02             96.015          100.00  25.996
## ISG20               88.74             88.741           91.65  38.765
## ESYT1               85.83             85.832           85.83  33.080
## CENPF               62.43             14.548           80.52  62.429
## DLG3                65.46             65.465           65.46  51.773
## PEX26               43.64             64.780           43.64  64.780
## OTOS                29.54             63.411           17.46  63.411
## PSMF1               45.10             45.098           62.43  31.235
## ENO1                33.46             44.242           62.43  44.242
## KHDRBS3             45.10             45.098           51.30   9.328
## CABP1               50.92             50.917           50.92  45.611
## HECW1               21.82             50.404           21.82  50.404
## GATC                36.37             36.369           49.91  14.120
## NUDT18              18.91             30.550           47.12  30.550
## F2                  45.10             45.098           45.10  20.281
## CTXN3               26.19             26.186           44.34  15.370
## POU4F2              41.68             18.912           18.91  41.682
## CLEC2L              27.64             40.819           27.64  40.819
## FRS3                33.46             33.460           38.77   5.756
## IGFALS              37.13              1.797           20.68  37.128

In this file going back, the rcart algorithm scored the best accuracy of 42% using our set.seed value of 4567, 10 folds of cross validation, and Accuracy for the metric. The top 5 genes for the rcart are:

importanceRcart

## rpart variable importance
## 
##   only 20 most important variables shown (out of 33)
## 
##           Overall
## PEX26      100.00
## OTOS        76.92
## RNF168      75.37
## CENPF       57.82
## ESYT1       34.65
## ENO1        34.55
## DLG3        27.20
## SLC1A1       0.00
## HECW1        0.00
## LOC400657    0.00
## CTXN3        0.00
## GATC         0.00
## POU4F2       0.00
## IGFALS       0.00
## HPGD         0.00
## FAM162A      0.00
## CYP7B1       0.00
## MAP2K7       0.00
## KCNJ16       0.00
## RGPD3        0.00

CENFP, ENO1, ESYT1, RNF168, and CYP7B1 are the top 5 genes for classifying our target classes from the model with the best accuracy. Lets read in the gene summaries and select those genes from that data. This was manually imputed from Genecards.org. Get the file made by me by clicking gene summaries

summary33 <- read.csv('33geneschrsummaries.csv', header=T, na.strings=c('',' ', 'na','NA'))

topGenes5 <- c("CENPF","ENO1","ESYT1","RNF168","CYP7B1")

best5 <- summary33[summary33$Gene %in% topGenes5,]
best5$Gene

## [1] "CENPF"  "CYP7B1" "ENO1"   "ESYT1"  "RNF168"

best5

##      Gene Chromosomal.Location
## 2   CENPF          chr1q32-q41
## 5  CYP7B1            chr8q21.3
## 7    ENO1      chr1p36.3-p36.2
## 8   ESYT1           chr12q13.2
## 31 RNF168              chr3q29
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             GeneCardsGeneSummary
## 2                                                                                                                                                                                                                                                                CENPF (Centromere Protein F) is a Protein Coding gene. Diseases associated with CENPF include Stromme Syndrome and Corpus Callosum, Agenesis Of. Among its related pathways are EML4 and NUDC in mitotic spindle formation and Separation of Sister Chromatids. Gene Ontology (GO) annotations related to this gene include protein homodimerization activity and transcription factor binding.
## 5  GeneCards Symbol: CYP7B1 2 \nCytochrome P450 Family 7 Subfamily B Member 1 2 3 5\nCYP7B1 (Cytochrome P450 Family 7 Subfamily B Member 1) is a Protein Coding gene. Diseases associated with CYP7B1 include Spastic Paraplegia 5A, Autosomal Recessive and Bile Acid Synthesis Defect, Congenital, 3. Among its related pathways are Metapathway biotransformation Phase I and II and Bile acid and bile salt metabolism. Gene Ontology (GO) annotations related to this gene include iron ion binding and oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen. An important paralog of this gene is CYP7A1.
## 7                                                                                                                                                                                                                          GeneCards Symbol: ENO1 2 \nEnolase 1 2 3 4 5\n\nENO1 (Enolase 1) is a Protein Coding gene. Diseases associated with ENO1 include Glycogen Storage Disease Xiii and Lung Cancer Susceptibility 3. Among its related pathways are glycolysis (BioCyc) and Gluconeogenesis. Gene Ontology (GO) annotations related to this gene include RNA binding and transcription corepressor activity. An important paralog of this gene is ENO2.\n
## 8                                                                                                                  Aliases for ESYT1 Gene\nGeneCards Symbol: ESYT1 2 \nExtended Synaptotagmin 1 2 3 5\nMBC2 2 3 4 5\nKIAA0747 2 4 5\nFAM62A 3 4 5      \nESYT1 (Extended Synaptotagmin 1) is a Protein Coding gene. Diseases associated with ESYT1 include Developmental And Epileptic Encephalopathy 26 and Stormorken Syndrome. Among its related pathways are Signaling by Rho GTPases and RAC2 GTPase cycle. Gene Ontology (GO) annotations related to this gene include lipid binding and phospholipid binding. An important paralog of this gene is ESYT2.
## 31                                                                                                                                        GeneCards Symbol: RNF168 2 \nRing Finger Protein 168 RNF168 (Ring Finger Protein 168) is a Protein Coding gene. Diseases associated with RNF168 include Riddle Syndrome and Digeorge Syndrome. Among its related pathways are HDR through Homologous Recombination (HRR) or Single Strand Annealing (SSA) and DNA Double Strand Break Response. Gene Ontology (GO) annotations related to this gene include chromatin binding and ubiquitin-protein transferase activity. An important paralog of this gene is RNF169.
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      NCBI_GeneSummary
## 2                     This gene encodes a protein that associates with the centromere-kinetochore complex. The protein is a component of the nuclear matrix during the G2 phase of interphase. In late G2 the protein associates with the kinetochore and maintains this association through early anaphase. It localizes to the spindle midzone and the intracellular bridge in late anaphase and telophase, respectively, and is thought to be subsequently degraded. The localization of this protein suggests that it may play a role in chromosome segregation during mitotis. It is thought to form either a homodimer or heterodimer. Autoantibodies against this protein have been found in patients with cancer or graft versus host disease. [provided by RefSeq, Jul 2008]
## 5  This gene encodes a member of the cytochrome P450 superfamily of enzymes. The cytochrome P450 proteins are monooxygenases which catalyze many reactions involved in drug metabolism and synthesis of cholesterol, steroids and other lipids. This endoplasmic reticulum membrane protein catalyzes the first reaction in the cholesterol catabolic pathway of extrahepatic tissues, which converts cholesterol to bile acids. This enzyme likely plays a minor role in total bile acid synthesis, but may also be involved in the development of atherosclerosis, neurosteroid metabolism and sex hormone synthesis. Mutations in this gene have been associated with hereditary spastic paraplegia (SPG5 or HSP), an autosomal recessive disorder. [provided by RefSeq, Apr 2016]
## 7                                                                                               This gene encodes alpha-enolase, one of three enolase isoenzymes found in mammals. Each isoenzyme is a homodimer composed of 2 alpha, 2 gamma, or 2 beta subunits, and functions as a glycolytic enzyme. Alpha-enolase in addition, functions as a structural lens protein (tau-crystallin) in the monomeric form. Alternative splicing of this gene results in a shorter isoform that has been shown to bind to the c-myc promoter and function as a tumor suppressor. Several pseudogenes have been identified, including one on the long arm of chromosome 1. Alpha-enolase has also been identified as an autoantigen in Hashimoto encephalopathy. [provided by RefSeq, Jan 2011]
## 8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Enables identical protein binding activity and phospholipid transfer activity. Involved in intermembrane lipid transfer. Located in endoplasmic reticulum membrane. [provided by Alliance of Genome Resources, Jul 2025]
## 31                                                                                                                                                                                                                                                                                                                                                                                                           This gene encodes an E3 ubiquitin ligase protein that contains a RING finger, a motif present in a variety of functionally distinct proteins and known to be involved in protein-DNA and protein-protein interactions. The protein is involved in DNA double-strand break (DSB) repair. Mutations in this gene result in Riddle syndrome. [provided by RefSeq, Sep 2011]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            UniProtKB_Swiss.Prot_GeneSummary
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Required for kinetochore function and chromosome segregation in mitosis. Required for kinetochore localization of dynein, LIS1, NDE1 and NDEL1. Regulates recycling of the plasma membrane by acting as a link between recycling vesicles and the microtubule network though its association with STX4 and SNAP25. Acts as a potential inhibitor of pocket protein-mediated cellular processes during development by regulating the activity of RB proteins during cell division and proliferation. May play a regulatory or permissive role in the normal embryonic cardiomyocyte cell cycle and in promoting continued mitosis in transformed, abnormally dividing neonatal cardiomyocytes. Interaction with RB directs embryonic stem cells toward a cardiac lineage. Involved in the regulation of DNA synthesis and hence cell cycle progression, via its C-terminus. Has a potential role regulating skeletal myogenesis and in cell differentiation in embryogenesis. Involved in dendritic cell regulation of T-cell immunity against chlamydia. ( CENPF_HUMAN,P49454 )
## 5                                    A cytochrome P450 monooxygenase involved in the metabolism of endogenous oxysterols and steroid hormones, including neurosteroids (PubMed:10588945, 24491228). Mechanistically, uses molecular oxygen inserting one oxygen atom into a substrate, and reducing the second into a water molecule, with two electrons provided by NADPH via cytochrome P450 reductase (CPR; NADPH-ferrihemoprotein reductase) (PubMed:10588945, 24491228). Catalyzes the hydroxylation of carbon hydrogen bonds of steroids with a preference for 7-alpha position (PubMed:10588945, 24491228). Usually metabolizes steroids carrying a hydroxy group at position 3, functioning as a 3-hydroxy steroid 7-alpha hydroxylase (PubMed:24491228). Hydroxylates oxysterols, including 25-hydroxycholesterol and (25R)-cholest-5-ene-3beta,26-diol toward 7-alpha hydroxy derivatives, which may be transported to the liver and converted to bile acids (PubMed:10588945, 9802883). Via its product 7-alpha,25-dihydroxycholesterol, a ligand for the chemotactic G protein-coupled receptor GPR183/EBI2, regulates B cell migration in germinal centers of lymphoid organs, thus guiding efficient maturation of plasma B cells and overall antigen-specific humoral immune response (By similarity). 7-alpha hydroxylates neurosteroids, including 3beta-hydroxyandrost-5-en-17-one (dehydroepiandrosterone) and pregnenolone, both involved in hippocampus-associated memory and learning (PubMed:24491228). Metabolizes androstanoids toward 6- or 7-alpha hydroxy derivatives (PubMed:24491228). ( CP7B1_HUMAN,O75881 )
## 7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Glycolytic enzyme the catalyzes the conversion of 2-phosphoglycerate to phosphoenolpyruvate (PubMed:1369209, 29775581). In addition to glycolysis, involved in various processes such as growth control, hypoxia tolerance and allergic responses (PubMed:10802057, 12666133, 2005901, 29775581). May also function in the intravascular and pericellular fibrinolytic system due to its ability to serve as a receptor and activator of plasminogen on the cell surface of several cell-types such as leukocytes and neurons (PubMed:12666133). Stimulates immunoglobulin production (PubMed:1369209). ( ENOA_HUMAN,P06733 )\n\n[Isoform MBP-1]: Binds to the myc promoter and acts as a transcriptional repressor. May be a tumor suppressor. ( ENOA_HUMAN,P06733 )
## 8                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Binds calcium (via the C2 domains) and translocates to sites of contact between the endoplasmic reticulum and the cell membrane in response to increased cytosolic calcium levels (PubMed:23791178, 24183667). Helps tether the endoplasmic reticulum to the cell membrane and promotes the formation of appositions between the endoplasmic reticulum and the cell membrane (PubMed:24183667). Acts as an inhibitor of ADGRD1 G-protein-coupled receptor activity in absence of cytosolic calcium (PubMed:38758649). Binds glycerophospholipids in a barrel-like domain and may play a role in cellular lipid transport (By similarity). ( ESYT1_HUMAN,Q9BSJ8 )
## 31 E3 ubiquitin-protein ligase required for accumulation of repair proteins to sites of DNA damage. Acts with UBE2N/UBC13 to amplify the RNF8-dependent histone ubiquitination. Recruited to sites of DNA damage at double-strand breaks (DSBs) by binding to ubiquitinated histone H2A and H2AX and amplifies the RNF8-dependent H2A ubiquitination, promoting the formation of 'Lys-63'-linked ubiquitin conjugates. This leads to concentrate ubiquitinated histones H2A and H2AX at DNA lesions to the threshold required for recruitment of TP53BP1 and BRCA1. Also recruited at DNA interstrand cross-links (ICLs) sites and promotes accumulation of 'Lys-63'-linked ubiquitination of histones H2A and H2AX, leading to recruitment of FAAP20/C1orf86 and Fanconi anemia (FA) complex, followed by interstrand cross-link repair. H2A ubiquitination also mediates the ATM-dependent transcriptional silencing at regions flanking DSBs in cis, a mechanism to avoid collision between transcription and repair intermediates. Also involved in class switch recombination in immune system, via its role in regulation of DSBs repair. Following DNA damage, promotes the ubiquitination and degradation of JMJD2A/KDM4A in collaboration with RNF8, leading to unmask H4K20me2 mark and promote the recruitment of TP53BP1 at DNA damage sites. Not able to initiate 'Lys-63'-linked ubiquitination in vitro; possibly due to partial occlusion of the UBE2N/UBC13-binding region. Catalyzes monoubiquitination of 'Lys-13' and 'Lys-15' of nucleosomal histone H2A (H2AK13Ub and H2AK15Ub, respectively). ( RN168_HUMAN,Q8IYW5 )
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          overallGeneSummary
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           involved in cell replication of mitosis early phases and when destroyed by autonomic responses the host has cancer, gene located on long arm q of chromosome 1.
## 5                                                                                                                                                                                                                                                                                                                                                              encodes a protein that is involved in network of processes with cholesterol and drug digestion as well as creating bile acids from cholesterol to digest food macromolecules like fats and proteins. Found in plasma membrane of endoplasmic reticulum of liver. Gene located on long arm q of chromosome 8.
## 7                                                                                                                                                                                                                                                                                                                                                                                                                                                     involved in breaking down fat stores for energy in glycolysis and tumor supressing and defective or involved in thyroid disease of genetic Hashimoto's syndrome. This gene is located on short arm p of chromosome 1.
## 8  relates to lipid binding and phospholipid binding and lipid membrane transport in endoplasmic reticulum membrane and cytosol membrane, takes up excess calcium in cytosol of cell and transports it to endoplasmic reticulum membrane, inhibits activity of a coupled receptor called the ADGRD1 G-protein-coupled receptor in absence of calcium in cytosol. AI search returned that ADGRD1 G is a protein found upregulated in trascription when in states of hypoxia and related to glioblastoma growth when upregulated, also it is responsible for many physiological functions in the body and muscle growth. This gene is located on long arm q of chromosome 12.
## 31                                                                                                                                                                                                                                                                                                                                      This gene is a DNA repairing gene that seeks damaged DNA and binds the chromatin at the histones of wrapped DNA to mend the damaged DNA and replace nucleotide sequences or if not repairable then to communicate to TP53 and BRC1 proteins of cell death needed for that cellular DNA. Gene located on long arm q of chromosome 3.

The top 5 genes have summaries that say they are involved in the cellular processes for replication, bile formation from cholesterol and drug digestion, tumor suppressing and glycolysis, cell membrane lipid binding, and repairing DNA damage. Seems like the genes selected are being most expressed or under expressed when infected with Lyme disease at various stages up to six months post infection. Are these genes being more up regulated or down regulated? Lets see from the data on overall change from acute infection vs chronic infection in fold changes to healthy sample means. The file for fold change values in sample comparisons can be found by clicking here for data

foldchanges <- read.csv('top33genes100varsChrSumms.csv', header=T, na.strings=c('',' ','na','NA'))
colnames(foldchanges)

##   [1] "Gene"                                
##   [2] "healthyControl_1"                    
##   [3] "healthyControl_2"                    
##   [4] "healthyControl_3"                    
##   [5] "healthyControl_4"                    
##   [6] "healthyControl_5"                    
##   [7] "healthyControl_6"                    
##   [8] "healthyControl_7"                    
##   [9] "healthyControl_8"                    
##  [10] "healthyControl_9"                    
##  [11] "healthyControl_10"                   
##  [12] "healthyControl_11"                   
##  [13] "healthyControl_12"                   
##  [14] "healthyControl_13"                   
##  [15] "healthyControl_14"                   
##  [16] "healthyControl_15"                   
##  [17] "healthyControl_16"                   
##  [18] "healthyControl_17"                   
##  [19] "healthyControl_18"                   
##  [20] "healthyControl_19"                   
##  [21] "healthyControl_20"                   
##  [22] "healthyControl_21"                   
##  [23] "acuteLymeDisease_1"                  
##  [24] "acuteLymeDisease_2"                  
##  [25] "acuteLymeDisease_3"                  
##  [26] "acuteLymeDisease_4"                  
##  [27] "acuteLymeDisease_5"                  
##  [28] "acuteLymeDisease_6"                  
##  [29] "acuteLymeDisease_7"                  
##  [30] "acuteLymeDisease_8"                  
##  [31] "acuteLymeDisease_9"                  
##  [32] "acuteLymeDisease_10"                 
##  [33] "acuteLymeDisease_11"                 
##  [34] "acuteLymeDisease_12"                 
##  [35] "acuteLymeDisease_13"                 
##  [36] "acuteLymeDisease_14"                 
##  [37] "acuteLymeDisease_15"                 
##  [38] "acuteLymeDisease_16"                 
##  [39] "acuteLymeDisease_17"                 
##  [40] "acuteLymeDisease_18"                 
##  [41] "acuteLymeDisease_19"                 
##  [42] "acuteLymeDisease_20"                 
##  [43] "acuteLymeDisease_21"                 
##  [44] "acuteLymeDisease_22"                 
##  [45] "acuteLymeDisease_23"                 
##  [46] "acuteLymeDisease_24"                 
##  [47] "acuteLymeDisease_25"                 
##  [48] "acuteLymeDisease_26"                 
##  [49] "acuteLymeDisease_27"                 
##  [50] "acuteLymeDisease_28"                 
##  [51] "Antibodies_1month_1"                 
##  [52] "Antibodies_1month_2"                 
##  [53] "Antibodies_1month_3"                 
##  [54] "Antibodies_1month_4"                 
##  [55] "Antibodies_1month_5"                 
##  [56] "Antibodies_1month_6"                 
##  [57] "Antibodies_1month_7"                 
##  [58] "Antibodies_1month_8"                 
##  [59] "Antibodies_1month_9"                 
##  [60] "Antibodies_1month_10"                
##  [61] "Antibodies_1month_11"                
##  [62] "Antibodies_1month_12"                
##  [63] "Antibodies_1month_13"                
##  [64] "Antibodies_1month_14"                
##  [65] "Antibodies_1month_15"                
##  [66] "Antibodies_1month_16"                
##  [67] "Antibodies_1month_17"                
##  [68] "Antibodies_1month_18"                
##  [69] "Antibodies_1month_19"                
##  [70] "Antibodies_1month_20"                
##  [71] "Antibodies_1month_21"                
##  [72] "Antibodies_1month_22"                
##  [73] "Antibodies_1month_23"                
##  [74] "Antibodies_1month_24"                
##  [75] "Antibodies_1month_25"                
##  [76] "Antibodies_1month_26"                
##  [77] "Antibodies_1month_27"                
##  [78] "Antibodies_6months_1"                
##  [79] "Antibodies_6months_2"                
##  [80] "Antibodies_6months_3"                
##  [81] "Antibodies_6months_4"                
##  [82] "Antibodies_6months_5"                
##  [83] "Antibodies_6months_6"                
##  [84] "Antibodies_6months_7"                
##  [85] "Antibodies_6months_8"                
##  [86] "Antibodies_6months_9"                
##  [87] "Antibodies_6months_10"               
##  [88] "healthy_Mean"                        
##  [89] "acuteLymeDisease_Mean"               
##  [90] "antibodies_1month_Mean"              
##  [91] "antibodies_6month_Mean"              
##  [92] "acuteHealthy_foldChange"             
##  [93] "antibodies_1month_healthy_foldChange"
##  [94] "antibodies_6month_healthy_foldchange"
##  [95] "acute_vs_6month_foldchange"          
##  [96] "Chromosomal.Location"                
##  [97] "GeneCardsGeneSummary"                
##  [98] "NCBI_GeneSummary"                    
##  [99] "UniProtKB_Swiss.Prot_GeneSummary"    
## [100] "overallGeneSummary"

changes <- foldchanges[foldchanges$Gene %in% topGenes5,]
changes6monthsVsHealthy <- changes[,c(1,94)]
changes6monthsVsHealthy

##      Gene antibodies_6month_healthy_foldchange
## 2   CENPF                            -6896.970
## 5  CYP7B1                             1067.072
## 7    ENO1                             1646.317
## 8   ESYT1                             1952.210
## 31 RNF168                             2964.558

It looks like all were upregulated except one gene, CENPF, that was the top important feature in classification of the top algorithm’s basic settings for rcart in the caret package. There was a foldchange increase of 1-2 thousand times healthy mean values for the genes CYP7B1, ENO1, ESYT1, and RNF168. The only gene downregulated was for mitotic activity in cell replication down regulated to almost 7,000 times the regular healthy mean values. Fat lipid metabolism and digestion to make bile and digest more fats in the liver as well as repairing DNA in the nucleus of the cell are the other top four genes that are upregulated after infection with Lyme disease.

This model scored 42% Accuracy in prediction of 4 classes or states of infection in Lyme disease. It can be made better by increasing accuracy by tuning the model, and so can the other models. As when using a different seed in a previous run the random forest, another ensemble model, scored the best at 62% and the others were closer to 50%. It depends on the randomness generated with the set.seed function as well as parameters tuned, like more cross validation folds, or a different metric other than Accuracy. Looking at the class metric statistis for sensitivity and specificity, lets pull back up the confusion matrix for the rcart model.

rcartDF <- data.frame(predictCart, type=testing$class)
rcartDF

##         predictCart              type
## 1   acute infection           healthy
## 2  1 month infected           healthy
## 3  1 month infected           healthy
## 4           healthy           healthy
## 5           healthy   acute infection
## 6           healthy   acute infection
## 7   acute infection   acute infection
## 8   acute infection   acute infection
## 9   acute infection   acute infection
## 10          healthy  1 month infected
## 11 1 month infected  1 month infected
## 12  acute infection  1 month infected
## 13          healthy  1 month infected
## 14 1 month infected  1 month infected
## 15 1 month infected 6 months infected
## 16          healthy 6 months infected

We see from data above that none of the samples of 6 months infection are actually in the testing set. Which is strange because the testing set count of each class said there were some six month infected samples in it. Lets double check.

testing %>% group_by(class) %>% count(class)

## # A tibble: 4 × 2
## # Groups:   class [4]
##   class                 n
##   <fct>             <int>
## 1 1 month infected      5
## 2 6 months infected     2
## 3 acute infection       5
## 4 healthy               4

Well, thats good to know it was my error in not scrolling to 2nd page in window. So there are 2 samples of the 6 months infected class. There are a few samples in each class from our imbalanced data.

rcartDF %>% group_by(predictCart) %>% count(predictCart)

## # A tibble: 3 × 2
## # Groups:   predictCart [3]
##   predictCart          n
##   <fct>            <int>
## 1 1 month infected     5
## 2 acute infection      5
## 3 healthy              6

The rcart algorithm, however, only predicted two classes, and left out the healthy and 6 month infected samples in favor of going with the majority class of 1 month infected. Lets just review how many samples are in each class from original data of 86 samples.

genes33 %>% group_by(class) %>% count(class)

## # A tibble: 4 × 2
## # Groups:   class [4]
##   class                 n
##   <fct>             <int>
## 1 1 month infected     27
## 2 6 months infected    10
## 3 acute infection      28
## 4 healthy              21

Actually, there are more acute infection samples by 28 compared to the 27 samples of 1 month infected. There are only 10 samples of the six months infected and 21 of the healthy samples in our data.

confusionMatrix(predictCart, testing$class)

## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          1 month infected 6 months infected acute infection healthy
##   1 month infected                 2                 1               0       2
##   6 months infected                0                 0               0       0
##   acute infection                  1                 0               3       1
##   healthy                          2                 1               2       1
## 
## Overall Statistics
##                                          
##                Accuracy : 0.375          
##                  95% CI : (0.152, 0.6457)
##     No Information Rate : 0.3125         
##     P-Value [Acc > NIR] : 0.382          
##                                          
##                   Kappa : 0.1209         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 1 month infected Class: 6 months infected
## Sensitivity                           0.4000                    0.000
## Specificity                           0.7273                    1.000
## Pos Pred Value                        0.4000                      NaN
## Neg Pred Value                        0.7273                    0.875
## Prevalence                            0.3125                    0.125
## Detection Rate                        0.1250                    0.000
## Detection Prevalence                  0.3125                    0.000
## Balanced Accuracy                     0.5636                    0.500
##                      Class: acute infection Class: healthy
## Sensitivity                          0.6000         0.2500
## Specificity                          0.8182         0.5833
## Pos Pred Value                       0.6000         0.1667
## Neg Pred Value                       0.8182         0.7000
## Prevalence                           0.3125         0.2500
## Detection Rate                       0.1875         0.0625
## Detection Prevalence                 0.3125         0.3750
## Balanced Accuracy                    0.7091         0.4167

By looking at the summary statistics, the specificity to rule in a class based on these genes (in rcart) was 100% for healthy and also for 6 months infection as those were not ruled in at all for any of the 2 samples of 6 months infected or the 4 samples of healthy status. While the sensitivity or snout to rule out in 1 month infected is 60% and 80% for acute infection. This wasn’t the best model overall as it just uses the odds in numbers in the data on the surface, but there are decisions and boundaries that are made in analyzing the data that the model builds upon to improve and get the best prediction.

For now, we can just say, as we do see these same genes for the most part in the other important variable features of the other models selected. So in first 6 months of Lyme disease assume cell replication slows while the DNA repair is enhanced and metabolic activity to have more lipids in the body. Which is what is used in the cell membrane of our cells in all systems of our bodies. Fat transport to make hormones, steroids, and cell membranes, bile production to digest fats and lipids. We are slowing down cell replication while speeding up DNA repair and also sending more byproducts into the body to build cell outer structure, use endocrine and reproductive hormones as well as steroids. People on chemotherapy are given steroids like prednisone. And chemotherapy slows cell replication. It looks like overall treatment of cancer is the same mechanism the body does in self repair in healing as shown in top genes being expressed in healthy, acute infection, one month post infection, and six months post infection or chronic infection states.

Thanks so much, more can be done to tune this model schematic, but this helps expand on revelations of this rpub document and research using R.

33 genes 4 classes LymeDisease with 6 Models

Janis Harris

2025-12-28