This is an extension to the publication on use of randomForest on the gene expression data of Lyme disease using the wide data frame too large to fit into google sheets. In this project we test base function for Principal Component Analysis or PCA.

Trying out the randomForest package without the caret package to see how it would do on the lyme disease data that we used earlier that had an imbalanced number of classes. You can read about the randomForest package at this link.

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

*** January 4, 2026 use of PCA analysis with base function prcomp() in R ***

PCA analysis is a way to take that fat or wide 28k X 86 dataframe too large to upload to Google sheets and get the top genes using standard deviation squared or variance between all the genes as covariance. It takes the space all points are in the samples for each gene and gets a best fit line using multilinear regression that has the smallest error, that line is a principal component and explains the most variation in data, then it keeps doing this with more principal components until it can explain all the variation in the data. Generally, the top 5 principal components can explain most of the information like 40-50% but depends on the data. We are going to use the very wide dataset by uploading it into R if you ran this code earlier and came back to it or by just calling our data frame if you haven’t cleared your environment from clutter. If you haven’t cleared your environment we previously named it Big20K_df. We will keep that name and not be influenced into calling it a fat data frame for clarity.

fatData <- read.csv('fat20k_Lyme_DF.csv', header=T, na.strings=c('',' ','na','NA'))
dim(fatData)
## [1]    86 19527
Big20k_df <- fatData
rm(fatData)

This is our data to work with we named fatData. Lets go ahead and make that covariance matrix of principal components. It’s really a very simple function too, not many parameters, but if you have questions in Rstudio, you can query help tab on prcomp(), remove the class factor feature that is at the end of the columns.

set.seed(123)
pc <- prcomp(Big20k_df[,-19527], center=T, scale=T)
plot(pc)

summary(pc)
## Importance of components:
##                            PC1     PC2      PC3      PC4     PC5      PC6
## Standard deviation     57.3023 45.2799 32.94151 25.61608 24.5226 24.19360
## Proportion of Variance  0.1682  0.1050  0.05557  0.03361  0.0308  0.02998
## Cumulative Proportion   0.1682  0.2732  0.32874  0.36235  0.3931  0.42312
##                            PC7      PC8      PC9    PC10     PC11     PC12
## Standard deviation     23.0032 20.12046 18.63438 17.5635 16.85407 16.72420
## Proportion of Variance  0.0271  0.02073  0.01778  0.0158  0.01455  0.01432
## Cumulative Proportion   0.4502  0.47095  0.48874  0.5045  0.51908  0.53341
##                            PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     15.73008 15.71459 15.46974 14.80272 14.73671 14.46296
## Proportion of Variance  0.01267  0.01265  0.01226  0.01122  0.01112  0.01071
## Cumulative Proportion   0.54608  0.55873  0.57098  0.58220  0.59333  0.60404
##                            PC19     PC20     PC21     PC22     PC23     PC24
## Standard deviation     14.31068 13.96785 13.68510 13.62904 13.42979 13.37657
## Proportion of Variance  0.01049  0.00999  0.00959  0.00951  0.00924  0.00916
## Cumulative Proportion   0.61453  0.62452  0.63411  0.64362  0.65286  0.66202
##                            PC25     PC26     PC27     PC28     PC29    PC30
## Standard deviation     13.20408 13.01388 12.82795 12.60898 12.44282 12.4238
## Proportion of Variance  0.00893  0.00867  0.00843  0.00814  0.00793  0.0079
## Cumulative Proportion   0.67095  0.67963  0.68805  0.69620  0.70413  0.7120
##                           PC31     PC32     PC33     PC34     PC35    PC36
## Standard deviation     12.2633 12.16672 11.98442 11.91197 11.83482 11.6943
## Proportion of Variance  0.0077  0.00758  0.00736  0.00727  0.00717  0.0070
## Cumulative Proportion   0.7197  0.72731  0.73467  0.74194  0.74911  0.7561
##                            PC37     PC38     PC39     PC40     PC41     PC42
## Standard deviation     11.62048 11.46630 11.38428 11.35928 11.20916 11.15546
## Proportion of Variance  0.00692  0.00673  0.00664  0.00661  0.00643  0.00637
## Cumulative Proportion   0.76303  0.76976  0.77640  0.78301  0.78944  0.79582
##                            PC43     PC44     PC45     PC46    PC47     PC48
## Standard deviation     11.02531 10.97348 10.89760 10.85652 10.8278 10.75487
## Proportion of Variance  0.00623  0.00617  0.00608  0.00604  0.0060  0.00592
## Cumulative Proportion   0.80204  0.80821  0.81429  0.82033  0.8263  0.83225
##                            PC49     PC50     PC51     PC52    PC53     PC54
## Standard deviation     10.66966 10.53594 10.50910 10.47575 10.4570 10.34719
## Proportion of Variance  0.00583  0.00569  0.00566  0.00562  0.0056  0.00548
## Cumulative Proportion   0.83809  0.84377  0.84943  0.85505  0.8607  0.86613
##                            PC55     PC56     PC57     PC58    PC59    PC60
## Standard deviation     10.24568 10.13388 10.08263 10.00480 9.93180 9.86660
## Proportion of Variance  0.00538  0.00526  0.00521  0.00513 0.00505 0.00499
## Cumulative Proportion   0.87151  0.87677  0.88197  0.88710 0.89215 0.89714
##                           PC61    PC62    PC63   PC64    PC65    PC66    PC67
## Standard deviation     9.81009 9.76697 9.63333 9.5749 9.51237 9.49472 9.41028
## Proportion of Variance 0.00493 0.00489 0.00475 0.0047 0.00463 0.00462 0.00454
## Cumulative Proportion  0.90206 0.90695 0.91170 0.9164 0.92103 0.92565 0.93018
##                           PC68    PC69    PC70    PC71    PC72    PC73    PC74
## Standard deviation     9.35807 9.29721 9.19491 9.04679 9.01475 8.96950 8.93143
## Proportion of Variance 0.00448 0.00443 0.00433 0.00419 0.00416 0.00412 0.00409
## Cumulative Proportion  0.93467 0.93910 0.94343 0.94762 0.95178 0.95590 0.95998
##                           PC75    PC76    PC77    PC78    PC79    PC80    PC81
## Standard deviation     8.89740 8.81122 8.77294 8.59722 8.46056 8.45647 8.35300
## Proportion of Variance 0.00405 0.00398 0.00394 0.00379 0.00367 0.00366 0.00357
## Cumulative Proportion  0.96404 0.96801 0.97196 0.97574 0.97941 0.98307 0.98664
##                           PC82    PC83    PC84    PC85      PC86
## Standard deviation     8.23462 8.18085 8.06189 7.81487 5.433e-14
## Proportion of Variance 0.00347 0.00343 0.00333 0.00313 0.000e+00
## Cumulative Proportion  0.99012 0.99354 0.99687 1.00000 1.000e+00

The first 5 principal components explain 16.8, 10.5, 5.6, 3.4, and 3.1 % of the information which is 39.4% of the data. We can use these first five, select more, use only the top on,e or use all of them.

It is recommended to use as many as it takes to explain at least 80% of the data, or those having eigenvalues > 1.

Note: (don’t read if math confuses you because you will introduce confusion and its ok to not be confused). Eigenvalues are those values in the matrix from linear algebra where PCA was derived for data science that are like a placemaker when there is an irrational or sqrt of -1 that if you wait for another sqrt of -1 irrational value it becomes a real value of 1. From what I remember from undergrad in my math bachelor of science. But the internet has a better definition as the value when transforming a matrix that when you take the eigenvector(v) to linearly transfrom a matrix (T) it is scaled by a lamda (l) by the eigenvector (v), so Tv=lv, that can result in a complex number (or irrational sqrt(-1)) for lambda (l). You don’t really need to know the math exactly to understand how it works. Similarities can be seen in how radiographs work to smash down the osseous material x-rays cannot pass through to take a 3D body and put the information onto a 2D space where interpretation of x-ray imaging needs specialized training to understand what portion of the body in the anterior, posterior, lateral, medial related to position or view. This is how I summarized those complex values after doing many undergrad courses in math, like a placemarker until another likeness comes by to make the real value.(note math explanation end is now)

PCA analysis is used to shrink the number of dimensions and typically on gene expression data for small numbers of samples. It can be used in unsupervised cases as well like when dealing with many frequencies in predicting class of a song by different conditions of the frequencies as an examples, but also for gene expression analysis using the genes of high quantity as observations and conditions of each gene on the columns or as features like what temperature, pH change, etc. each condition is applied to the gene.

We will try the tol and rank parameters that the help menu on prcomp() says can be used together in setting the top ranking principal components used in the rank parameter with the tol parameter that takes those principal components that are the magnitude that is the least allowed to be used in the computation.

Lets make our training and testing sets first to build the model on training set of 80% and then test on the testing set of 20% of samples.

set.seed(123)

intrain <- sample(1:86, .8*86)
training <- Big20k_df[intrain,]
testing <- Big20k_df[-intrain,]

Doing this helps build the model so it doesn’t overfit the data or cause data leakage where building a model on only the data you have to test on the data you used in the model doesn’t represent the real world populations of samples the best it can. In other words, accuracy will be very low. The tol parameter set to 1 gave an error so it was removed.

pc_train <- prcomp(training[,-19527], rank=10, center=T, scale=T)
summary(pc_train)
## Importance of first k=10 (out of 68) components:
##                            PC1     PC2      PC3      PC4      PC5      PC6
## Standard deviation     56.3091 45.9352 32.17554 26.54574 25.20717 24.74766
## Proportion of Variance  0.1624  0.1081  0.05302  0.03609  0.03254  0.03137
## Cumulative Proportion   0.1624  0.2705  0.32347  0.35956  0.39210  0.42346
##                             PC7      PC8      PC9     PC10
## Standard deviation     22.64715 20.90406 19.70915 18.80862
## Proportion of Variance  0.02627  0.02238  0.01989  0.01812
## Cumulative Proportion   0.44973  0.47211  0.49200  0.51012

It used 10 principal components set by the rank parameter equal to 10. This explains between 45-50% of information in training data.

plot(pc_train)

Each bar above is in descending order from left to right of the top principal component to least of the top 10.

pc_test <- prcomp(testing[,-19527], rank=10, center=T, scale=T)
summary(pc_test)
## Importance of first k=10 (out of 18) components:
##                            PC1     PC2      PC3      PC4      PC5      PC6
## Standard deviation     72.3680 50.4673 44.06333 36.00175 33.83106 30.75806
## Proportion of Variance  0.2682  0.1304  0.09944  0.06638  0.05862  0.04845
## Cumulative Proportion   0.2682  0.3987  0.49809  0.56447  0.62308  0.67153
##                             PC7      PC8      PC9     PC10
## Standard deviation     29.25880 26.74365 25.52624 25.22356
## Proportion of Variance  0.04384  0.03663  0.03337  0.03258
## Cumulative Proportion   0.71538  0.75201  0.78538  0.81796
plot(pc_test)

train_pca_coords <- predict(pc_train, newdata = training[,-19527])
test_pca_coords <- predict(pc_test, newdata = testing[,-19527])
trainedModel <- data.frame(Target=training$class, train_pca_coords)
testedModel <- data.frame(Target=testing$class, test_pca_coords)
trainedModel$Target <- as.factor(trainedModel$Target)
testedModel$Target <- as.factor(testedModel$Target)
model_fit <- randomForest(trainedModel[,2:11],trainedModel[,1],mtry=3, ntree=500, replace=T, importance=T, keep.forest = T, norm.votes=T) 
summary(model_fit)
##                 Length Class  Mode     
## call               9   -none- call     
## type               1   -none- character
## predicted         68   factor numeric  
## err.rate        2500   -none- numeric  
## confusion         20   -none- numeric  
## votes            272   matrix numeric  
## oob.times         68   -none- numeric  
## classes            4   -none- character
## importance        60   -none- numeric  
## importanceSD      50   -none- numeric  
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y                 68   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL
model_fit$confusion
##         1 month 6 month acute healthy class.error
## 1 month       7       0     8       4   0.6315789
## 6 month       3       4     1       1   0.5555556
## acute         8       0    14       1   0.3913043
## healthy       1       1     2      13   0.2352941

The accuracy measures for the 4 classes was 63% for one month infected, 56% for six months infected, 39% for acute infection, and 23% for healthy.

model_fit$predicted
##      31      79      51      14      67      42      50      43      83      25 
##   acute 1 month   acute 6 month 1 month   acute   acute   acute 6 month 1 month 
##      69      57       9      72      26       7      81      74      36      85 
## 1 month   acute healthy 1 month healthy healthy 1 month   acute 1 month 6 month 
##      15      32      70      45      71      76      41      10      23      27 
## 1 month 1 month   acute   acute 1 month healthy 1 month healthy 1 month   acute 
##      53      62      56      75      65      38      77      34      29       5 
## healthy 1 month   acute healthy   acute   acute   acute   acute   acute healthy 
##       8      12      13      18      33      54      80      84      21      66 
## healthy healthy healthy healthy 1 month 1 month 6 month 6 month healthy 1 month 
##      73      86      16      30       6      11      46      22      48      55 
## healthy healthy   acute 1 month healthy healthy 1 month   acute   acute   acute 
##      49      17      28      82      40       2       4      44 
##   acute   acute   acute 1 month   acute healthy healthy   acute 
## Levels: 1 month 6 month acute healthy
predicted_trained_DF_pca <- data.frame(predicted=model_fit$predicted, actual=training$class)
predicted_trained_DF_pca
##    predicted  actual
## 31     acute   acute
## 79   1 month 6 month
## 51     acute 1 month
## 14   6 month healthy
## 67   1 month 1 month
## 42     acute   acute
## 50     acute 1 month
## 43     acute   acute
## 83   6 month 6 month
## 25   1 month   acute
## 69   1 month 1 month
## 57     acute 1 month
## 9    healthy healthy
## 72   1 month 1 month
## 26   healthy   acute
## 7    healthy healthy
## 81   1 month 6 month
## 74     acute 1 month
## 36   1 month   acute
## 85   6 month 6 month
## 15   1 month healthy
## 32   1 month   acute
## 70     acute 1 month
## 45     acute   acute
## 71   1 month 1 month
## 76   healthy 1 month
## 41   1 month   acute
## 10   healthy healthy
## 23   1 month   acute
## 27     acute   acute
## 53   healthy 1 month
## 62   1 month 1 month
## 56     acute 1 month
## 75   healthy 1 month
## 65     acute 1 month
## 38     acute   acute
## 77     acute 6 month
## 34     acute   acute
## 29     acute   acute
## 5    healthy healthy
## 8    healthy healthy
## 12   healthy healthy
## 13   healthy healthy
## 18   healthy healthy
## 33   1 month   acute
## 54   1 month 1 month
## 80   6 month 6 month
## 84   6 month 6 month
## 21   healthy healthy
## 66   1 month 1 month
## 73   healthy 1 month
## 86   healthy 6 month
## 16     acute healthy
## 30   1 month   acute
## 6    healthy healthy
## 11   healthy healthy
## 46   1 month   acute
## 22     acute   acute
## 48     acute   acute
## 55     acute 1 month
## 49     acute   acute
## 17     acute healthy
## 28     acute   acute
## 82   1 month 6 month
## 40     acute   acute
## 2    healthy healthy
## 4    healthy healthy
## 44     acute   acute

The overall accuracy in class prediction for four classes using PCA on training data only was:

sum(predicted_trained_DF_pca$predicted==predicted_trained_DF_pca$actual)/length(predicted_trained_DF_pca$predicted)
## [1] 0.5588235

[1] 0.5588235 or 56% by using only the top 10 principal componenents that only explained about 45-50% of the variation in the data to begin with. It might be higher accuracy if all principal components used.

Lets see how well it predicts the hold out data set of 20% original data in the testing data. I cannot fit these variables as labeled to fit to the testing data of actual features. Still figuring out how to do that because these models built on error and not the actual features themselves to predict class. Results won’t provide best genes this way. It will just predict the best class possible by using the noise in the data. But lets see results anyhow.

model_fit_testing <- randomForest(testedModel[,2:11],testedModel[,1],mtry=3, ntree=500, replace=T, importance=T, keep.forest = T, norm.votes=T) 
summary(model_fit_testing)
##                 Length Class  Mode     
## call               9   -none- call     
## type               1   -none- character
## predicted         18   factor numeric  
## err.rate        2500   -none- numeric  
## confusion         20   -none- numeric  
## votes             72   matrix numeric  
## oob.times         18   -none- numeric  
## classes            4   -none- character
## importance        60   -none- numeric  
## importanceSD      50   -none- numeric  
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y                 18   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL
model_fit_testing$confusion
##         1 month 6 month acute healthy class.error
## 1 month       3       0     4       1       0.625
## 6 month       1       0     0       0       1.000
## acute         4       0     0       1       1.000
## healthy       3       0     1       0       1.000

The results were poor for all 3 classes at 100% error on all classes except for the one month infection class of 63% error.

testingPC_df <- data.frame(predicted=model_fit_testing$predicted, actual=testedModel$Target)
testingPC_df
##    predicted  actual
## 1    1 month healthy
## 3    1 month healthy
## 19   1 month healthy
## 20     acute healthy
## 24   1 month   acute
## 35   1 month   acute
## 37   1 month   acute
## 39   1 month   acute
## 47   healthy   acute
## 52     acute 1 month
## 58   1 month 1 month
## 59   1 month 1 month
## 60     acute 1 month
## 61   healthy 1 month
## 63     acute 1 month
## 64   1 month 1 month
## 68     acute 1 month
## 78   1 month 6 month

results in accuracy overall are:

sum(testingPC_df$predicted==testingPC_df$actual)/length(testingPC_df$predicted)
## [1] 0.1666667

17% was accuracy in predicting 4 classes overall.

The whole idea of PCA is the complex numbers and irrational values in transformation of space. Since it is based on the noise or variance or error in the data it cannot predict the actual genes as best predictors to find a target gene but can be used to classify samples as an alternate method to classifying based on specific genes’ ability and functions that contribute to pathologies they can be affecting thus the target genes for gene therapy. This would be more like a test to see if a sample is infected and how far along. Better results could be done with all principal components that explained all the information. Also, its beneficial in saving time and reducing dimensionality of space, but those same dimensions are not the feature space but the noise of the feature space. Great work in understanding PCA.

Lets revisit all the principal components in the pc data to fit to the training data and testing data in predicting sample type.

summary(pc)
## Importance of components:
##                            PC1     PC2      PC3      PC4     PC5      PC6
## Standard deviation     57.3023 45.2799 32.94151 25.61608 24.5226 24.19360
## Proportion of Variance  0.1682  0.1050  0.05557  0.03361  0.0308  0.02998
## Cumulative Proportion   0.1682  0.2732  0.32874  0.36235  0.3931  0.42312
##                            PC7      PC8      PC9    PC10     PC11     PC12
## Standard deviation     23.0032 20.12046 18.63438 17.5635 16.85407 16.72420
## Proportion of Variance  0.0271  0.02073  0.01778  0.0158  0.01455  0.01432
## Cumulative Proportion   0.4502  0.47095  0.48874  0.5045  0.51908  0.53341
##                            PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     15.73008 15.71459 15.46974 14.80272 14.73671 14.46296
## Proportion of Variance  0.01267  0.01265  0.01226  0.01122  0.01112  0.01071
## Cumulative Proportion   0.54608  0.55873  0.57098  0.58220  0.59333  0.60404
##                            PC19     PC20     PC21     PC22     PC23     PC24
## Standard deviation     14.31068 13.96785 13.68510 13.62904 13.42979 13.37657
## Proportion of Variance  0.01049  0.00999  0.00959  0.00951  0.00924  0.00916
## Cumulative Proportion   0.61453  0.62452  0.63411  0.64362  0.65286  0.66202
##                            PC25     PC26     PC27     PC28     PC29    PC30
## Standard deviation     13.20408 13.01388 12.82795 12.60898 12.44282 12.4238
## Proportion of Variance  0.00893  0.00867  0.00843  0.00814  0.00793  0.0079
## Cumulative Proportion   0.67095  0.67963  0.68805  0.69620  0.70413  0.7120
##                           PC31     PC32     PC33     PC34     PC35    PC36
## Standard deviation     12.2633 12.16672 11.98442 11.91197 11.83482 11.6943
## Proportion of Variance  0.0077  0.00758  0.00736  0.00727  0.00717  0.0070
## Cumulative Proportion   0.7197  0.72731  0.73467  0.74194  0.74911  0.7561
##                            PC37     PC38     PC39     PC40     PC41     PC42
## Standard deviation     11.62048 11.46630 11.38428 11.35928 11.20916 11.15546
## Proportion of Variance  0.00692  0.00673  0.00664  0.00661  0.00643  0.00637
## Cumulative Proportion   0.76303  0.76976  0.77640  0.78301  0.78944  0.79582
##                            PC43     PC44     PC45     PC46    PC47     PC48
## Standard deviation     11.02531 10.97348 10.89760 10.85652 10.8278 10.75487
## Proportion of Variance  0.00623  0.00617  0.00608  0.00604  0.0060  0.00592
## Cumulative Proportion   0.80204  0.80821  0.81429  0.82033  0.8263  0.83225
##                            PC49     PC50     PC51     PC52    PC53     PC54
## Standard deviation     10.66966 10.53594 10.50910 10.47575 10.4570 10.34719
## Proportion of Variance  0.00583  0.00569  0.00566  0.00562  0.0056  0.00548
## Cumulative Proportion   0.83809  0.84377  0.84943  0.85505  0.8607  0.86613
##                            PC55     PC56     PC57     PC58    PC59    PC60
## Standard deviation     10.24568 10.13388 10.08263 10.00480 9.93180 9.86660
## Proportion of Variance  0.00538  0.00526  0.00521  0.00513 0.00505 0.00499
## Cumulative Proportion   0.87151  0.87677  0.88197  0.88710 0.89215 0.89714
##                           PC61    PC62    PC63   PC64    PC65    PC66    PC67
## Standard deviation     9.81009 9.76697 9.63333 9.5749 9.51237 9.49472 9.41028
## Proportion of Variance 0.00493 0.00489 0.00475 0.0047 0.00463 0.00462 0.00454
## Cumulative Proportion  0.90206 0.90695 0.91170 0.9164 0.92103 0.92565 0.93018
##                           PC68    PC69    PC70    PC71    PC72    PC73    PC74
## Standard deviation     9.35807 9.29721 9.19491 9.04679 9.01475 8.96950 8.93143
## Proportion of Variance 0.00448 0.00443 0.00433 0.00419 0.00416 0.00412 0.00409
## Cumulative Proportion  0.93467 0.93910 0.94343 0.94762 0.95178 0.95590 0.95998
##                           PC75    PC76    PC77    PC78    PC79    PC80    PC81
## Standard deviation     8.89740 8.81122 8.77294 8.59722 8.46056 8.45647 8.35300
## Proportion of Variance 0.00405 0.00398 0.00394 0.00379 0.00367 0.00366 0.00357
## Cumulative Proportion  0.96404 0.96801 0.97196 0.97574 0.97941 0.98307 0.98664
##                           PC82    PC83    PC84    PC85      PC86
## Standard deviation     8.23462 8.18085 8.06189 7.81487 5.433e-14
## Proportion of Variance 0.00347 0.00343 0.00333 0.00313 0.000e+00
## Cumulative Proportion  0.99012 0.99354 0.99687 1.00000 1.000e+00
resultsAll_pc <- predict(pc, newdata=Big20k_df[,-19527])
model_fit_all <- randomForest(resultsAll_pc[,2:11],resultsAll_pc[,1],mtry=3, ntree=500, replace=T, importance=T, keep.forest = T, norm.votes=T) 
summary(model_fit_all)
##                 Length Class  Mode     
## call              9    -none- call     
## type              1    -none- character
## predicted        86    -none- numeric  
## mse             500    -none- numeric  
## rsq             500    -none- numeric  
## oob.times        86    -none- numeric  
## importance       20    -none- numeric  
## importanceSD     10    -none- numeric  
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           11    -none- list     
## coefs             0    -none- NULL     
## y                86    -none- numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL
trainingModel_all <- data.frame(Target=as.factor(Big20k_df$class), resultsAll_pc)
model_fit_all <- randomForest(trainingModel_all[,2:86],trainingModel_all[,1],mtry=3, ntree=500, replace=T, importance=T, keep.forest = T, norm.votes=T) 
summary(model_fit_all)
##                 Length Class  Mode     
## call               9   -none- call     
## type               1   -none- character
## predicted         86   factor numeric  
## err.rate        2500   -none- numeric  
## confusion         20   -none- numeric  
## votes            344   matrix numeric  
## oob.times         86   -none- numeric  
## classes            4   -none- character
## importance       510   -none- numeric  
## importanceSD     425   -none- numeric  
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y                 86   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL
plot(model_fit_all)

model_fit_all$confusion
##         1 month 6 month acute healthy class.error
## 1 month      11       0    14       2   0.5925926
## 6 month       3       0     5       2   1.0000000
## acute         6       0    19       3   0.3214286
## healthy      12       0     6       3   0.8571429
allData_PCA_results_df <- data.frame(predicted=model_fit_all$predicted, actual=Big20k_df$class)
allData_PCA_results_df
##    predicted  actual
## 1    1 month healthy
## 2    1 month healthy
## 3    healthy healthy
## 4      acute healthy
## 5      acute healthy
## 6    healthy healthy
## 7    healthy healthy
## 8    1 month healthy
## 9    1 month healthy
## 10   1 month healthy
## 11   1 month healthy
## 12   1 month healthy
## 13   1 month healthy
## 14   1 month healthy
## 15   1 month healthy
## 16     acute healthy
## 17     acute healthy
## 18   1 month healthy
## 19     acute healthy
## 20     acute healthy
## 21   1 month healthy
## 22     acute   acute
## 23     acute   acute
## 24     acute   acute
## 25     acute   acute
## 26     acute   acute
## 27     acute   acute
## 28     acute   acute
## 29     acute   acute
## 30   1 month   acute
## 31   1 month   acute
## 32     acute   acute
## 33     acute   acute
## 34     acute   acute
## 35     acute   acute
## 36   1 month   acute
## 37     acute   acute
## 38     acute   acute
## 39   1 month   acute
## 40   1 month   acute
## 41     acute   acute
## 42     acute   acute
## 43   healthy   acute
## 44   healthy   acute
## 45   1 month   acute
## 46     acute   acute
## 47     acute   acute
## 48     acute   acute
## 49   healthy   acute
## 50     acute 1 month
## 51     acute 1 month
## 52     acute 1 month
## 53     acute 1 month
## 54   1 month 1 month
## 55     acute 1 month
## 56     acute 1 month
## 57     acute 1 month
## 58     acute 1 month
## 59     acute 1 month
## 60   1 month 1 month
## 61   healthy 1 month
## 62   1 month 1 month
## 63     acute 1 month
## 64     acute 1 month
## 65   1 month 1 month
## 66   1 month 1 month
## 67     acute 1 month
## 68   1 month 1 month
## 69   1 month 1 month
## 70     acute 1 month
## 71   1 month 1 month
## 72   1 month 1 month
## 73   1 month 1 month
## 74     acute 1 month
## 75   1 month 1 month
## 76   healthy 1 month
## 77   healthy 6 month
## 78   1 month 6 month
## 79   1 month 6 month
## 80     acute 6 month
## 81   healthy 6 month
## 82   1 month 6 month
## 83     acute 6 month
## 84     acute 6 month
## 85     acute 6 month
## 86     acute 6 month
sum(allData_PCA_results_df$predicted==allData_PCA_results_df$actual)/length(allData_PCA_results_df$predicted)
## [1] 0.3837209

copy of RStudio output: [1] 0.3953488

Using all the principal components that explain the variation, noise, or error in the data of gene expression data having 19,526 genes in a four class target in Random Forest with 500 trees and an mtry of 3 with the principal components found to be 86 vectors using the prcomp function in base R, found to only give 40% accuracy in predicting class using all data observations and not the 80% training and 20% testing data.

copying the confusion matrix: 1 month 6 month acute healthy class.error 1 month 11 0 14 2 0.5925926 6 month 4 0 5 1 1.0000000 acute 9 0 18 1 0.3571429 healthy 11 0 5 5 0.7619048

Although, the data does fail at predicting all 4 classes. The confusion matrix showed, that it failed 100% in predicting the samples that were infected six months and the best at predicting healthy classes at 76%. Next best was one month infected at 59% and acute at 36%. This data wasn’t balanced and that is a big problem to begin with in not having a balance of samples to work with as there are less cases of six months infection and many more healthy, acute, and one month infected cases.

Big20k_df %>% group_by(class) %>% count(class)
## # A tibble: 4 × 2
## # Groups:   class [4]
##   class       n
##   <chr>   <int>
## 1 1 month    27
## 2 6 month    10
## 3 acute      28
## 4 healthy    21

PCA is great for reducing dimensions to the covariance or noisy space to make class predictions based on error in the data, but in this project not useful for gene targeted analysis in Lyme disease RNA sequence data with an disproportionate amount of samples per class of 4 classes for our target feature. It is useful as well in possibly equal samples per class that don’t have outliers or is tuned to handle outliers in a certain algorithmic way. It can handle wide data that none of the other models I have worked with recently can, but not for intended purposes. Using fold change values of up and down regulated genes and random forest in the caret package or the randomForest package seem to be the best way at approaching a way to find the top genes in therapeutic strategies of gene therapy where protein coding genes are defective or not present or overly present. Gene therapeutics are a great way to use stem cells to create those gene products to make proteins the body needs like plasma rich protein or PRP for cartilage in the joints of hyaline aligned bone and labral or meniscoid tissue in the discs, glenohumeral joint labrum, femoral tibial lateral and medial condyle menisci, and femoral acetabular joints with the labrum. The use in finding gene products in pathologies is to solve those defective genes and bring the body back to good health. To those in evil laboratories with unethial human and animal testing they can be used to silently eliminate enemies by knowing their conditions of survival and for like inhalation, intravenous, oral digestion, transcutaneous applicaton, and so on. That is a problem in biological warfare, and in it is a covert mission how is anybody to know if they were violated or infected. Finding a timeline of infection is a great point, and is why Lyme disease was of interest because many people have it and don’t recall a typical method of getting it. Because if infection takes 36 hours for the larvae to bud in the latched deer tick to the human skin undetected before reaching a threshold of infection, could some biohackers out there be eliminating the people they hate because they refuse to interact peacefully with them? It is a valid concern, as we have news now of detecting genes in prostate and breast cancer that lead to variations possibly to copy number variants at the post-transcription state of splicing introns into mRNA. It is odd that young people as outliers can get certain cancers, and many have had conflicts documented on social media with healthcare providers before dying or succumbing to the gene defect. We know environmental hazards and substances do cause cancers and California labels them as still marketable to consumers as long as labeled prop65 carcinogen. But petrol and petrol products are a known carcinogen and unless you live off the grid somewhere the air isn’t filtering out the same toxic air pollution of heavily dependent petrols you are exposed. We know viral infections can cause gene mutations like the herpes family or Epstein Barr virus among many. We also know that stress of any sort weakens immunity from emotional, psychological trauma, financial difficulties, fear, worry, panic, anxiety, and so on. Also, ingested toxins like aflotoxins in mold of nuts as well as inhaled toxins from bacteria like bat and bird infected droppings or black mold in construction fields, bacteria in soil, and quartz dust in granite workers, and coal miners black lung can lead to chronic infection, weakened immunity, and inflammation that can bring lung cancer. Since the gene expression analysis is to find the target genes and use that concept of creating the products and protein products to introduce to patients that have those defective genes, have there been cases that this was done? Yes, Netflix had a video documentary on a guy with Leukemia where his blood cancer was cured with stem cell therapy to correct his defective white blood cells and red blood cells. There are others, but you need stem cell treatment. In the US, I don’t think limb saving stem cell treatments are approved as people have to pay 10k+ for each doseage with limited improvement if treated before cartilage completely gone like 5-30% more cartilage, some 50-75% better. But for cancer therapy it is usually covered as it is a last resource to use if chemotherapy doesn’t work. You will have to explore that yourselves. In this document we discovered that although PCA can make a model for prediction it doesn’t identify those genes that are the target. Maybe unless you use the scatterplot feature that is interactive and manually or engineer a way to get those genes as close to the primary principal component as possible, then maybe but not from this project was that uncovered. The packages of ggplot2 and matplotlib a python package that reticulate in R is needed to use, can produce interactive plots where a hovering over the points on the scatter plot in multidimensial space can show which genes are having the least error in prediction. But that still doesn’t say that the gene is a target gene but rather that it is one having little variation or change within the samples but needs to show obvious changes between the samples to be a target gene of involvement in a pathology. No resources cited, I have 3 years of chiropractic medical school (crap at the chiropractic procedures portion as failed to see it work as massage always treats better from my experience when done regularly with exercise and stretching) that discussed heavily pathologies, anatomy, infection and immunology, and diagnostic treatment. As well as a Bachelor of science in math and economics and a Master of science degree in data science.