This is an extension to the publication on use of randomForest on the gene expression data of Lyme disease using the wide data frame too large to fit into google sheets. In this project we test base function for Principal Component Analysis or PCA.

Trying out the randomForest package without the caret package to see how it would do on the lyme disease data that we used earlier that had an imbalanced number of classes. You can read about the randomForest package at this link.

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

*** January 4, 2026 use of PCA analysis with base function prcomp() in R ***

PCA analysis is a way to take that fat or wide 28k X 86 dataframe too large to upload to Google sheets and get the top genes using standard deviation squared or variance between all the genes as covariance. It takes the space all points are in the samples for each gene and gets a best fit line using multilinear regression that has the smallest error, that line is a principal component and explains the most variation in data, then it keeps doing this with more principal components until it can explain all the variation in the data. Generally, the top 5 principal components can explain most of the information like 40-50% but depends on the data. We are going to use the very wide dataset by uploading it into R if you ran this code earlier and came back to it or by just calling our data frame if you haven’t cleared your environment from clutter. If you haven’t cleared your environment we previously named it Big20K_df. We will keep that name and not be influenced into calling it a fat data frame for clarity.

fatData <- read.csv('fat20k_Lyme_DF.csv', header=T, na.strings=c('',' ','na','NA'))
dim(fatData)
## [1]    86 19527
Big20k_df <- fatData
rm(fatData)

This is our data to work with we named fatData. Lets go ahead and make that covariance matrix of principal components. It’s really a very simple function too, not many parameters, but if you have questions in Rstudio, you can query help tab on prcomp(), remove the class factor feature that is at the end of the columns.

pc <- prcomp(Big20k_df[,-19527], center=T, scale=T)
summary(pc)
## Importance of components:
##                            PC1     PC2      PC3      PC4     PC5      PC6
## Standard deviation     57.3023 45.2799 32.94151 25.61608 24.5226 24.19360
## Proportion of Variance  0.1682  0.1050  0.05557  0.03361  0.0308  0.02998
## Cumulative Proportion   0.1682  0.2732  0.32874  0.36235  0.3931  0.42312
##                            PC7      PC8      PC9    PC10     PC11     PC12
## Standard deviation     23.0032 20.12046 18.63438 17.5635 16.85407 16.72420
## Proportion of Variance  0.0271  0.02073  0.01778  0.0158  0.01455  0.01432
## Cumulative Proportion   0.4502  0.47095  0.48874  0.5045  0.51908  0.53341
##                            PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     15.73008 15.71459 15.46974 14.80272 14.73671 14.46296
## Proportion of Variance  0.01267  0.01265  0.01226  0.01122  0.01112  0.01071
## Cumulative Proportion   0.54608  0.55873  0.57098  0.58220  0.59333  0.60404
##                            PC19     PC20     PC21     PC22     PC23     PC24
## Standard deviation     14.31068 13.96785 13.68510 13.62904 13.42979 13.37657
## Proportion of Variance  0.01049  0.00999  0.00959  0.00951  0.00924  0.00916
## Cumulative Proportion   0.61453  0.62452  0.63411  0.64362  0.65286  0.66202
##                            PC25     PC26     PC27     PC28     PC29    PC30
## Standard deviation     13.20408 13.01388 12.82795 12.60898 12.44282 12.4238
## Proportion of Variance  0.00893  0.00867  0.00843  0.00814  0.00793  0.0079
## Cumulative Proportion   0.67095  0.67963  0.68805  0.69620  0.70413  0.7120
##                           PC31     PC32     PC33     PC34     PC35    PC36
## Standard deviation     12.2633 12.16672 11.98442 11.91197 11.83482 11.6943
## Proportion of Variance  0.0077  0.00758  0.00736  0.00727  0.00717  0.0070
## Cumulative Proportion   0.7197  0.72731  0.73467  0.74194  0.74911  0.7561
##                            PC37     PC38     PC39     PC40     PC41     PC42
## Standard deviation     11.62048 11.46630 11.38428 11.35928 11.20916 11.15546
## Proportion of Variance  0.00692  0.00673  0.00664  0.00661  0.00643  0.00637
## Cumulative Proportion   0.76303  0.76976  0.77640  0.78301  0.78944  0.79582
##                            PC43     PC44     PC45     PC46    PC47     PC48
## Standard deviation     11.02531 10.97348 10.89760 10.85652 10.8278 10.75487
## Proportion of Variance  0.00623  0.00617  0.00608  0.00604  0.0060  0.00592
## Cumulative Proportion   0.80204  0.80821  0.81429  0.82033  0.8263  0.83225
##                            PC49     PC50     PC51     PC52    PC53     PC54
## Standard deviation     10.66966 10.53594 10.50910 10.47575 10.4570 10.34719
## Proportion of Variance  0.00583  0.00569  0.00566  0.00562  0.0056  0.00548
## Cumulative Proportion   0.83809  0.84377  0.84943  0.85505  0.8607  0.86613
##                            PC55     PC56     PC57     PC58    PC59    PC60
## Standard deviation     10.24568 10.13388 10.08263 10.00480 9.93180 9.86660
## Proportion of Variance  0.00538  0.00526  0.00521  0.00513 0.00505 0.00499
## Cumulative Proportion   0.87151  0.87677  0.88197  0.88710 0.89215 0.89714
##                           PC61    PC62    PC63   PC64    PC65    PC66    PC67
## Standard deviation     9.81009 9.76697 9.63333 9.5749 9.51237 9.49472 9.41028
## Proportion of Variance 0.00493 0.00489 0.00475 0.0047 0.00463 0.00462 0.00454
## Cumulative Proportion  0.90206 0.90695 0.91170 0.9164 0.92103 0.92565 0.93018
##                           PC68    PC69    PC70    PC71    PC72    PC73    PC74
## Standard deviation     9.35807 9.29721 9.19491 9.04679 9.01475 8.96950 8.93143
## Proportion of Variance 0.00448 0.00443 0.00433 0.00419 0.00416 0.00412 0.00409
## Cumulative Proportion  0.93467 0.93910 0.94343 0.94762 0.95178 0.95590 0.95998
##                           PC75    PC76    PC77    PC78    PC79    PC80    PC81
## Standard deviation     8.89740 8.81122 8.77294 8.59722 8.46056 8.45647 8.35300
## Proportion of Variance 0.00405 0.00398 0.00394 0.00379 0.00367 0.00366 0.00357
## Cumulative Proportion  0.96404 0.96801 0.97196 0.97574 0.97941 0.98307 0.98664
##                           PC82    PC83    PC84    PC85      PC86
## Standard deviation     8.23462 8.18085 8.06189 7.81487 5.433e-14
## Proportion of Variance 0.00347 0.00343 0.00333 0.00313 0.000e+00
## Cumulative Proportion  0.99012 0.99354 0.99687 1.00000 1.000e+00

The first 5 principal components explain 16.8, 10.5, 5.6, 3.4, and 3.1 % of the information which is 39.4% of the data. We can use these first five, select more, use only the top on,e or use all of them.

It is recommended to use as many as it takes to explain at least 80% of the data, or those having eigenvalues > 1.

Note: (don’t read if math confuses you because you will introduce confusion and its ok to not be confused). Eigenvalues are those values in the matrix from linear algebra where PCA was derived for data science that are like a placemaker when there is an irrational or sqrt of -1 that if you wait for another sqrt of -1 irrational value it becomes a real value of 1. From what I remember from undergrad in my math bachelor of science. But the internet has a better definition as the value when transforming a matrix that when you take the eigenvector(v) to linearly transfrom a matrix (T) it is scaled by a lamda (l) by the eigenvector (v), so Tv=lv, that can result in a complex number (or irrational sqrt(-1)) for lambda (l). You don’t really need to know the math exactly to understand how it works. Similarities can be seen in how radiographs work to smash down the osseous material x-rays cannot pass through to take a 3D body and put the information onto a 2D space where interpretation of x-ray imaging needs specialized training to understand what portion of the body in the anterior, posterior, lateral, medial related to position or view. This is how I summarized those complex values after doing many undergrad courses in math, like a placemarker until another likeness comes by to make the real value.(note math explanation end is now)

PCA analysis is used to shrink the number of dimensions and typically on gene expression data for small numbers of samples. It can be used in unsupervised cases as well like when dealing with many frequencies in predicting class of a song by different conditions of the frequencies as an examples, but also for gene expression analysis using the genes of high quantity as observations and conditions of each gene on the columns or as features like what temperature, pH change, etc. each condition is applied to the gene.

We will try the tol and rank parameters that the help menu on prcomp() says can be used together in setting the top ranking principal components used in the rank parameter with the tol parameter that takes those principal components that are the magnitude that is the least allowed to be used in the computation.

Lets make our training and testing sets first to build the model on training set of 80% and then test on the testing set of 20% of samples.

set.seed(123)

intrain <- sample(1:86, .8*86)
training <- Big20k_df[intrain,]
testing <- Big20k_df[-intrain,]

Doing this helps build the model so it doesn’t overfit the data or cause data leakage where building a model on only the data you have to test on the data you used in the model doesn’t represent the real world populations of samples the best it can. In other words, accuracy will be very low. The tol parameter set to 1 gave an error so it was removed.

pc_train <- prcomp(training[,-19527], rank=10, center=T, scale=T)
summary(pc_train)
## Importance of first k=10 (out of 68) components:
##                            PC1     PC2      PC3      PC4      PC5      PC6
## Standard deviation     56.3091 45.9352 32.17554 26.54574 25.20717 24.74766
## Proportion of Variance  0.1624  0.1081  0.05302  0.03609  0.03254  0.03137
## Cumulative Proportion   0.1624  0.2705  0.32347  0.35956  0.39210  0.42346
##                             PC7      PC8      PC9     PC10
## Standard deviation     22.64715 20.90406 19.70915 18.80862
## Proportion of Variance  0.02627  0.02238  0.01989  0.01812
## Cumulative Proportion   0.44973  0.47211  0.49200  0.51012

It used 10 principal components set by the rank parameter equal to 10. This explains between 45-50% of information in training data.

plot(pc_train)

Each bar above is in descending order from left to right of the top principal component to least of the top 10.

pc_test <- prcomp(testing[,-19527], rank=10, center=T, scale=T)
summary(pc_test)
## Importance of first k=10 (out of 18) components:
##                            PC1     PC2      PC3      PC4      PC5      PC6
## Standard deviation     72.3680 50.4673 44.06333 36.00175 33.83106 30.75806
## Proportion of Variance  0.2682  0.1304  0.09944  0.06638  0.05862  0.04845
## Cumulative Proportion   0.2682  0.3987  0.49809  0.56447  0.62308  0.67153
##                             PC7      PC8      PC9     PC10
## Standard deviation     29.25880 26.74365 25.52624 25.22356
## Proportion of Variance  0.04384  0.03663  0.03337  0.03258
## Cumulative Proportion   0.71538  0.75201  0.78538  0.81796
plot(pc_test)

train_pca_coords <- predict(pc_train, newdata = training[,-19527])
test_pca_coords <- predict(pc_test, newdata = testing[,-19527])
trainedModel <- data.frame(Target=training$class, train_pca_coords)
testedModel <- data.frame(Target=testing$class, test_pca_coords)
trainedModel$Target <- as.factor(trainedModel$Target)
testedModel$Target <- as.factor(testedModel$Target)
model_fit <- randomForest(trainedModel[,2:11],trainedModel[,1],mtry=3, ntree=500, replace=T, importance=T, keep.forest = T, norm.votes=T) 
summary(model_fit)
##                 Length Class  Mode     
## call               9   -none- call     
## type               1   -none- character
## predicted         68   factor numeric  
## err.rate        2500   -none- numeric  
## confusion         20   -none- numeric  
## votes            272   matrix numeric  
## oob.times         68   -none- numeric  
## classes            4   -none- character
## importance        60   -none- numeric  
## importanceSD      50   -none- numeric  
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y                 68   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL
model_fit$confusion
##         1 month 6 month acute healthy class.error
## 1 month       7       0     8       4   0.6315789
## 6 month       3       4     1       1   0.5555556
## acute         8       0    14       1   0.3913043
## healthy       1       1     2      13   0.2352941

The accuracy measures for the 4 classes was 63% for one month infected, 56% for six months infected, 39% for acute infection, and 23% for healthy.

model_fit$predicted
##      31      79      51      14      67      42      50      43      83      25 
##   acute 1 month   acute 6 month 1 month   acute   acute   acute 6 month 1 month 
##      69      57       9      72      26       7      81      74      36      85 
## 1 month   acute healthy 1 month healthy healthy 1 month   acute 1 month 6 month 
##      15      32      70      45      71      76      41      10      23      27 
## 1 month 1 month   acute   acute 1 month healthy 1 month healthy 1 month   acute 
##      53      62      56      75      65      38      77      34      29       5 
## healthy 1 month   acute healthy   acute   acute   acute   acute   acute healthy 
##       8      12      13      18      33      54      80      84      21      66 
## healthy healthy healthy healthy 1 month 1 month 6 month 6 month healthy 1 month 
##      73      86      16      30       6      11      46      22      48      55 
## healthy healthy   acute 1 month healthy healthy 1 month   acute   acute   acute 
##      49      17      28      82      40       2       4      44 
##   acute   acute   acute 1 month   acute healthy healthy   acute 
## Levels: 1 month 6 month acute healthy
predicted_trained_DF_pca <- data.frame(predicted=model_fit$predicted, actual=training$class)
predicted_trained_DF_pca
##    predicted  actual
## 31     acute   acute
## 79   1 month 6 month
## 51     acute 1 month
## 14   6 month healthy
## 67   1 month 1 month
## 42     acute   acute
## 50     acute 1 month
## 43     acute   acute
## 83   6 month 6 month
## 25   1 month   acute
## 69   1 month 1 month
## 57     acute 1 month
## 9    healthy healthy
## 72   1 month 1 month
## 26   healthy   acute
## 7    healthy healthy
## 81   1 month 6 month
## 74     acute 1 month
## 36   1 month   acute
## 85   6 month 6 month
## 15   1 month healthy
## 32   1 month   acute
## 70     acute 1 month
## 45     acute   acute
## 71   1 month 1 month
## 76   healthy 1 month
## 41   1 month   acute
## 10   healthy healthy
## 23   1 month   acute
## 27     acute   acute
## 53   healthy 1 month
## 62   1 month 1 month
## 56     acute 1 month
## 75   healthy 1 month
## 65     acute 1 month
## 38     acute   acute
## 77     acute 6 month
## 34     acute   acute
## 29     acute   acute
## 5    healthy healthy
## 8    healthy healthy
## 12   healthy healthy
## 13   healthy healthy
## 18   healthy healthy
## 33   1 month   acute
## 54   1 month 1 month
## 80   6 month 6 month
## 84   6 month 6 month
## 21   healthy healthy
## 66   1 month 1 month
## 73   healthy 1 month
## 86   healthy 6 month
## 16     acute healthy
## 30   1 month   acute
## 6    healthy healthy
## 11   healthy healthy
## 46   1 month   acute
## 22     acute   acute
## 48     acute   acute
## 55     acute 1 month
## 49     acute   acute
## 17     acute healthy
## 28     acute   acute
## 82   1 month 6 month
## 40     acute   acute
## 2    healthy healthy
## 4    healthy healthy
## 44     acute   acute

The overall accuracy in class prediction for four classes using PCA on training data only was:

sum(predicted_trained_DF_pca$predicted==predicted_trained_DF_pca$actual)/length(predicted_trained_DF_pca$predicted)
## [1] 0.5588235

[1] 0.5588235 or 56% by using only the top 10 principal componenents that only explained about 45-50% of the variation in the data to begin with. It might be higher accuracy if all principal components used.

Lets see how well it predicts the hold out data set of 20% original data in the testing data. I cannot fit these variables as labeled to fit to the testing data of actual features. Still figuring out how to do that because these models built on error and not the actual features themselves to predict class. Results won’t provide best genes this way. It will just predict the best class possible by using the noise in the data. But lets see results anyhow.

model_fit_testing <- randomForest(testedModel[,2:11],testedModel[,1],mtry=3, ntree=500, replace=T, importance=T, keep.forest = T, norm.votes=T) 
summary(model_fit_testing)
##                 Length Class  Mode     
## call               9   -none- call     
## type               1   -none- character
## predicted         18   factor numeric  
## err.rate        2500   -none- numeric  
## confusion         20   -none- numeric  
## votes             72   matrix numeric  
## oob.times         18   -none- numeric  
## classes            4   -none- character
## importance        60   -none- numeric  
## importanceSD      50   -none- numeric  
## localImportance    0   -none- NULL     
## proximity          0   -none- NULL     
## ntree              1   -none- numeric  
## mtry               1   -none- numeric  
## forest            14   -none- list     
## y                 18   factor numeric  
## test               0   -none- NULL     
## inbag              0   -none- NULL
model_fit_testing$confusion
##         1 month 6 month acute healthy class.error
## 1 month       3       0     4       1       0.625
## 6 month       1       0     0       0       1.000
## acute         4       0     0       1       1.000
## healthy       3       0     1       0       1.000

The results were poor for all 3 classes at 100% error on all classes except for the one month infection class of 63% error.

testingPC_df <- data.frame(predicted=model_fit_testing$predicted, actual=testedModel$Target)
testingPC_df
##    predicted  actual
## 1    1 month healthy
## 3    1 month healthy
## 19   1 month healthy
## 20     acute healthy
## 24   1 month   acute
## 35   1 month   acute
## 37   1 month   acute
## 39   1 month   acute
## 47   healthy   acute
## 52     acute 1 month
## 58   1 month 1 month
## 59   1 month 1 month
## 60     acute 1 month
## 61   healthy 1 month
## 63     acute 1 month
## 64   1 month 1 month
## 68     acute 1 month
## 78   1 month 6 month

results in accuracy overall are:

sum(testingPC_df$predicted==testingPC_df$actual)/length(testingPC_df$predicted)
## [1] 0.1666667

17% was accuracy in predicting 4 classes overall.

The whole idea of PCA is the complex numbers and irrational values in transformation of space. Since it is based on the noise or variance or error in the data it cannot predict the actual genes as best predictors to find a target gene but can be used to classify samples as an alternate method to classifying based on specific genes’ ability and functions that contribute to pathologies they can be affecting thus the target genes for gene therapy. This would be more like a test to see if a sample is infected and how far along. Better results could be done with all principal components that explained all the information. Also, its beneficial in saving time and reducing dimensionality of space, but those same dimensions are not the feature space but the noise of the feature space. Great work in understanding PCA.