This is an extension to the publication on use of randomForest on the gene expression data of Lyme disease using the wide data frame too large to fit into google sheets. In this project we test base function for Principal Component Analysis or PCA.
Trying out the randomForest package without the caret package to see how it would do on the lyme disease data that we used earlier that had an imbalanced number of classes. You can read about the randomForest package at this link.
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
*** January 4, 2026 use of PCA analysis with base function prcomp() in R ***
PCA analysis is a way to take that fat or wide 28k X 86 dataframe too large to upload to Google sheets and get the top genes using standard deviation squared or variance between all the genes as covariance. It takes the space all points are in the samples for each gene and gets a best fit line using multilinear regression that has the smallest error, that line is a principal component and explains the most variation in data, then it keeps doing this with more principal components until it can explain all the variation in the data. Generally, the top 5 principal components can explain most of the information like 40-50% but depends on the data. We are going to use the very wide dataset by uploading it into R if you ran this code earlier and came back to it or by just calling our data frame if you haven’t cleared your environment from clutter. If you haven’t cleared your environment we previously named it Big20K_df. We will keep that name and not be influenced into calling it a fat data frame for clarity.
fatData <- read.csv('fat20k_Lyme_DF.csv', header=T, na.strings=c('',' ','na','NA'))
dim(fatData)
## [1] 86 19527
Big20k_df <- fatData
rm(fatData)
This is our data to work with we named fatData. Lets go ahead and make that covariance matrix of principal components. It’s really a very simple function too, not many parameters, but if you have questions in Rstudio, you can query help tab on prcomp(), remove the class factor feature that is at the end of the columns.
pc <- prcomp(Big20k_df[,-19527], center=T, scale=T)
summary(pc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 57.3023 45.2799 32.94151 25.61608 24.5226 24.19360
## Proportion of Variance 0.1682 0.1050 0.05557 0.03361 0.0308 0.02998
## Cumulative Proportion 0.1682 0.2732 0.32874 0.36235 0.3931 0.42312
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 23.0032 20.12046 18.63438 17.5635 16.85407 16.72420
## Proportion of Variance 0.0271 0.02073 0.01778 0.0158 0.01455 0.01432
## Cumulative Proportion 0.4502 0.47095 0.48874 0.5045 0.51908 0.53341
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 15.73008 15.71459 15.46974 14.80272 14.73671 14.46296
## Proportion of Variance 0.01267 0.01265 0.01226 0.01122 0.01112 0.01071
## Cumulative Proportion 0.54608 0.55873 0.57098 0.58220 0.59333 0.60404
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 14.31068 13.96785 13.68510 13.62904 13.42979 13.37657
## Proportion of Variance 0.01049 0.00999 0.00959 0.00951 0.00924 0.00916
## Cumulative Proportion 0.61453 0.62452 0.63411 0.64362 0.65286 0.66202
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 13.20408 13.01388 12.82795 12.60898 12.44282 12.4238
## Proportion of Variance 0.00893 0.00867 0.00843 0.00814 0.00793 0.0079
## Cumulative Proportion 0.67095 0.67963 0.68805 0.69620 0.70413 0.7120
## PC31 PC32 PC33 PC34 PC35 PC36
## Standard deviation 12.2633 12.16672 11.98442 11.91197 11.83482 11.6943
## Proportion of Variance 0.0077 0.00758 0.00736 0.00727 0.00717 0.0070
## Cumulative Proportion 0.7197 0.72731 0.73467 0.74194 0.74911 0.7561
## PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 11.62048 11.46630 11.38428 11.35928 11.20916 11.15546
## Proportion of Variance 0.00692 0.00673 0.00664 0.00661 0.00643 0.00637
## Cumulative Proportion 0.76303 0.76976 0.77640 0.78301 0.78944 0.79582
## PC43 PC44 PC45 PC46 PC47 PC48
## Standard deviation 11.02531 10.97348 10.89760 10.85652 10.8278 10.75487
## Proportion of Variance 0.00623 0.00617 0.00608 0.00604 0.0060 0.00592
## Cumulative Proportion 0.80204 0.80821 0.81429 0.82033 0.8263 0.83225
## PC49 PC50 PC51 PC52 PC53 PC54
## Standard deviation 10.66966 10.53594 10.50910 10.47575 10.4570 10.34719
## Proportion of Variance 0.00583 0.00569 0.00566 0.00562 0.0056 0.00548
## Cumulative Proportion 0.83809 0.84377 0.84943 0.85505 0.8607 0.86613
## PC55 PC56 PC57 PC58 PC59 PC60
## Standard deviation 10.24568 10.13388 10.08263 10.00480 9.93180 9.86660
## Proportion of Variance 0.00538 0.00526 0.00521 0.00513 0.00505 0.00499
## Cumulative Proportion 0.87151 0.87677 0.88197 0.88710 0.89215 0.89714
## PC61 PC62 PC63 PC64 PC65 PC66 PC67
## Standard deviation 9.81009 9.76697 9.63333 9.5749 9.51237 9.49472 9.41028
## Proportion of Variance 0.00493 0.00489 0.00475 0.0047 0.00463 0.00462 0.00454
## Cumulative Proportion 0.90206 0.90695 0.91170 0.9164 0.92103 0.92565 0.93018
## PC68 PC69 PC70 PC71 PC72 PC73 PC74
## Standard deviation 9.35807 9.29721 9.19491 9.04679 9.01475 8.96950 8.93143
## Proportion of Variance 0.00448 0.00443 0.00433 0.00419 0.00416 0.00412 0.00409
## Cumulative Proportion 0.93467 0.93910 0.94343 0.94762 0.95178 0.95590 0.95998
## PC75 PC76 PC77 PC78 PC79 PC80 PC81
## Standard deviation 8.89740 8.81122 8.77294 8.59722 8.46056 8.45647 8.35300
## Proportion of Variance 0.00405 0.00398 0.00394 0.00379 0.00367 0.00366 0.00357
## Cumulative Proportion 0.96404 0.96801 0.97196 0.97574 0.97941 0.98307 0.98664
## PC82 PC83 PC84 PC85 PC86
## Standard deviation 8.23462 8.18085 8.06189 7.81487 5.433e-14
## Proportion of Variance 0.00347 0.00343 0.00333 0.00313 0.000e+00
## Cumulative Proportion 0.99012 0.99354 0.99687 1.00000 1.000e+00
The first 5 principal components explain 16.8, 10.5, 5.6, 3.4, and 3.1 % of the information which is 39.4% of the data. We can use these first five, select more, use only the top on,e or use all of them.
It is recommended to use as many as it takes to explain at least 80% of the data, or those having eigenvalues > 1.
Note: (don’t read if math confuses you because you will introduce confusion and its ok to not be confused). Eigenvalues are those values in the matrix from linear algebra where PCA was derived for data science that are like a placemaker when there is an irrational or sqrt of -1 that if you wait for another sqrt of -1 irrational value it becomes a real value of 1. From what I remember from undergrad in my math bachelor of science. But the internet has a better definition as the value when transforming a matrix that when you take the eigenvector(v) to linearly transfrom a matrix (T) it is scaled by a lamda (l) by the eigenvector (v), so Tv=lv, that can result in a complex number (or irrational sqrt(-1)) for lambda (l). You don’t really need to know the math exactly to understand how it works. Similarities can be seen in how radiographs work to smash down the osseous material x-rays cannot pass through to take a 3D body and put the information onto a 2D space where interpretation of x-ray imaging needs specialized training to understand what portion of the body in the anterior, posterior, lateral, medial related to position or view. This is how I summarized those complex values after doing many undergrad courses in math, like a placemarker until another likeness comes by to make the real value.(note math explanation end is now)
PCA analysis is used to shrink the number of dimensions and typically on gene expression data for small numbers of samples. It can be used in unsupervised cases as well like when dealing with many frequencies in predicting class of a song by different conditions of the frequencies as an examples, but also for gene expression analysis using the genes of high quantity as observations and conditions of each gene on the columns or as features like what temperature, pH change, etc. each condition is applied to the gene.
We will try the tol and rank parameters that the help menu on prcomp() says can be used together in setting the top ranking principal components used in the rank parameter with the tol parameter that takes those principal components that are the magnitude that is the least allowed to be used in the computation.
Lets make our training and testing sets first to build the model on training set of 80% and then test on the testing set of 20% of samples.
set.seed(123)
intrain <- sample(1:86, .8*86)
training <- Big20k_df[intrain,]
testing <- Big20k_df[-intrain,]
Doing this helps build the model so it doesn’t overfit the data or cause data leakage where building a model on only the data you have to test on the data you used in the model doesn’t represent the real world populations of samples the best it can. In other words, accuracy will be very low. The tol parameter set to 1 gave an error so it was removed.
pc_train <- prcomp(training[,-19527], rank=10, center=T, scale=T)
summary(pc_train)
## Importance of first k=10 (out of 68) components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 56.3091 45.9352 32.17554 26.54574 25.20717 24.74766
## Proportion of Variance 0.1624 0.1081 0.05302 0.03609 0.03254 0.03137
## Cumulative Proportion 0.1624 0.2705 0.32347 0.35956 0.39210 0.42346
## PC7 PC8 PC9 PC10
## Standard deviation 22.64715 20.90406 19.70915 18.80862
## Proportion of Variance 0.02627 0.02238 0.01989 0.01812
## Cumulative Proportion 0.44973 0.47211 0.49200 0.51012
It used 10 principal components set by the rank parameter equal to 10. This explains between 45-50% of information in training data.
plot(pc_train)
Each bar above is in descending order from left to right of the top
principal component to least of the top 10.
pc_test <- prcomp(testing[,-19527], rank=10, center=T, scale=T)
summary(pc_test)
## Importance of first k=10 (out of 18) components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 72.3680 50.4673 44.06333 36.00175 33.83106 30.75806
## Proportion of Variance 0.2682 0.1304 0.09944 0.06638 0.05862 0.04845
## Cumulative Proportion 0.2682 0.3987 0.49809 0.56447 0.62308 0.67153
## PC7 PC8 PC9 PC10
## Standard deviation 29.25880 26.74365 25.52624 25.22356
## Proportion of Variance 0.04384 0.03663 0.03337 0.03258
## Cumulative Proportion 0.71538 0.75201 0.78538 0.81796
plot(pc_test)
train_pca_coords <- predict(pc_train, newdata = training[,-19527])
test_pca_coords <- predict(pc_test, newdata = testing[,-19527])
trainedModel <- data.frame(Target=training$class, train_pca_coords)
testedModel <- data.frame(Target=testing$class, test_pca_coords)
trainedModel$Target <- as.factor(trainedModel$Target)
testedModel$Target <- as.factor(testedModel$Target)
model_fit <- randomForest(trainedModel[,2:11],trainedModel[,1],mtry=3, ntree=500, replace=T, importance=T, keep.forest = T, norm.votes=T)
summary(model_fit)
## Length Class Mode
## call 9 -none- call
## type 1 -none- character
## predicted 68 factor numeric
## err.rate 2500 -none- numeric
## confusion 20 -none- numeric
## votes 272 matrix numeric
## oob.times 68 -none- numeric
## classes 4 -none- character
## importance 60 -none- numeric
## importanceSD 50 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 68 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
model_fit$confusion
## 1 month 6 month acute healthy class.error
## 1 month 7 0 8 4 0.6315789
## 6 month 3 4 1 1 0.5555556
## acute 8 0 14 1 0.3913043
## healthy 1 1 2 13 0.2352941
The accuracy measures for the 4 classes was 63% for one month infected, 56% for six months infected, 39% for acute infection, and 23% for healthy.
model_fit$predicted
## 31 79 51 14 67 42 50 43 83 25
## acute 1 month acute 6 month 1 month acute acute acute 6 month 1 month
## 69 57 9 72 26 7 81 74 36 85
## 1 month acute healthy 1 month healthy healthy 1 month acute 1 month 6 month
## 15 32 70 45 71 76 41 10 23 27
## 1 month 1 month acute acute 1 month healthy 1 month healthy 1 month acute
## 53 62 56 75 65 38 77 34 29 5
## healthy 1 month acute healthy acute acute acute acute acute healthy
## 8 12 13 18 33 54 80 84 21 66
## healthy healthy healthy healthy 1 month 1 month 6 month 6 month healthy 1 month
## 73 86 16 30 6 11 46 22 48 55
## healthy healthy acute 1 month healthy healthy 1 month acute acute acute
## 49 17 28 82 40 2 4 44
## acute acute acute 1 month acute healthy healthy acute
## Levels: 1 month 6 month acute healthy
predicted_trained_DF_pca <- data.frame(predicted=model_fit$predicted, actual=training$class)
predicted_trained_DF_pca
## predicted actual
## 31 acute acute
## 79 1 month 6 month
## 51 acute 1 month
## 14 6 month healthy
## 67 1 month 1 month
## 42 acute acute
## 50 acute 1 month
## 43 acute acute
## 83 6 month 6 month
## 25 1 month acute
## 69 1 month 1 month
## 57 acute 1 month
## 9 healthy healthy
## 72 1 month 1 month
## 26 healthy acute
## 7 healthy healthy
## 81 1 month 6 month
## 74 acute 1 month
## 36 1 month acute
## 85 6 month 6 month
## 15 1 month healthy
## 32 1 month acute
## 70 acute 1 month
## 45 acute acute
## 71 1 month 1 month
## 76 healthy 1 month
## 41 1 month acute
## 10 healthy healthy
## 23 1 month acute
## 27 acute acute
## 53 healthy 1 month
## 62 1 month 1 month
## 56 acute 1 month
## 75 healthy 1 month
## 65 acute 1 month
## 38 acute acute
## 77 acute 6 month
## 34 acute acute
## 29 acute acute
## 5 healthy healthy
## 8 healthy healthy
## 12 healthy healthy
## 13 healthy healthy
## 18 healthy healthy
## 33 1 month acute
## 54 1 month 1 month
## 80 6 month 6 month
## 84 6 month 6 month
## 21 healthy healthy
## 66 1 month 1 month
## 73 healthy 1 month
## 86 healthy 6 month
## 16 acute healthy
## 30 1 month acute
## 6 healthy healthy
## 11 healthy healthy
## 46 1 month acute
## 22 acute acute
## 48 acute acute
## 55 acute 1 month
## 49 acute acute
## 17 acute healthy
## 28 acute acute
## 82 1 month 6 month
## 40 acute acute
## 2 healthy healthy
## 4 healthy healthy
## 44 acute acute
The overall accuracy in class prediction for four classes using PCA on training data only was:
sum(predicted_trained_DF_pca$predicted==predicted_trained_DF_pca$actual)/length(predicted_trained_DF_pca$predicted)
## [1] 0.5588235
[1] 0.5588235 or 56% by using only the top 10 principal componenents that only explained about 45-50% of the variation in the data to begin with. It might be higher accuracy if all principal components used.
Lets see how well it predicts the hold out data set of 20% original data in the testing data. I cannot fit these variables as labeled to fit to the testing data of actual features. Still figuring out how to do that because these models built on error and not the actual features themselves to predict class. Results won’t provide best genes this way. It will just predict the best class possible by using the noise in the data. But lets see results anyhow.
model_fit_testing <- randomForest(testedModel[,2:11],testedModel[,1],mtry=3, ntree=500, replace=T, importance=T, keep.forest = T, norm.votes=T)
summary(model_fit_testing)
## Length Class Mode
## call 9 -none- call
## type 1 -none- character
## predicted 18 factor numeric
## err.rate 2500 -none- numeric
## confusion 20 -none- numeric
## votes 72 matrix numeric
## oob.times 18 -none- numeric
## classes 4 -none- character
## importance 60 -none- numeric
## importanceSD 50 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 18 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
model_fit_testing$confusion
## 1 month 6 month acute healthy class.error
## 1 month 3 0 4 1 0.625
## 6 month 1 0 0 0 1.000
## acute 4 0 0 1 1.000
## healthy 3 0 1 0 1.000
The results were poor for all 3 classes at 100% error on all classes except for the one month infection class of 63% error.
testingPC_df <- data.frame(predicted=model_fit_testing$predicted, actual=testedModel$Target)
testingPC_df
## predicted actual
## 1 1 month healthy
## 3 1 month healthy
## 19 1 month healthy
## 20 acute healthy
## 24 1 month acute
## 35 1 month acute
## 37 1 month acute
## 39 1 month acute
## 47 healthy acute
## 52 acute 1 month
## 58 1 month 1 month
## 59 1 month 1 month
## 60 acute 1 month
## 61 healthy 1 month
## 63 acute 1 month
## 64 1 month 1 month
## 68 acute 1 month
## 78 1 month 6 month
results in accuracy overall are:
sum(testingPC_df$predicted==testingPC_df$actual)/length(testingPC_df$predicted)
## [1] 0.1666667
17% was accuracy in predicting 4 classes overall.
The whole idea of PCA is the complex numbers and irrational values in transformation of space. Since it is based on the noise or variance or error in the data it cannot predict the actual genes as best predictors to find a target gene but can be used to classify samples as an alternate method to classifying based on specific genes’ ability and functions that contribute to pathologies they can be affecting thus the target genes for gene therapy. This would be more like a test to see if a sample is infected and how far along. Better results could be done with all principal components that explained all the information. Also, its beneficial in saving time and reducing dimensionality of space, but those same dimensions are not the feature space but the noise of the feature space. Great work in understanding PCA.