This is a Principal Component Analysis, drawing heavily from mixOmics’ Case Studies:
In the liver toxicity case study, 64 observations of 3116 gene expression values were used in an effort to discover which genes accounted for the variance seen in clinical variables, and could thus be used to predict the values of those clinical variables.
Unfortunately, a cursory glance at data sets from either Scientific Data or Bioconductor did not have genomic and associated phenotypic data, so we will use the liver.toxicity data set that comes with mixOmics, because it has excellent characteristics for what we would like to do.
liver.toxicity$gene data as predictors, and live.toxicity$clinic data as outcomes.nearZeroVar to gene data to drop genes unlikely to have sufficient variance to inform outcomes.if (!require("mixOmics")) install.packages("mixOmics")
#Note: rgl installation failed-- installed XQuartz for X11
#https://www.xquartz.org/
rmse() will accept the actual and predicted numeric values, and will return the root mean squared error.
rmse <- function(actual, predicted){
return(sqrt(mean((actual-predicted)^2)))
}
metrics() will iterate over the variables in the predicted data frame to generate root mean squared error stats for each variable.
metrics <- function(actual, predicted){
#Hold the data to return
res_frame <- data.frame()
#Create vector for holding values
res_vec <- vector()
for (i in 1:ncol(actual)){
res_vec[length(res_vec)+1] <- rmse(actual[,i], predicted[,i])
}
#Bind
res_frame <- rbind(res_frame, res_vec)
colnames(res_frame) <- colnames(actual)
return(res_frame)
}
bootstrap_pls() will randomly sample 50% of observations from the predictors and generate clinical predictions, utilizing pls, from the model n times utilizing n_comp number of components and component number which_comp in the string form of "dim comp_num".
bootstrap_pls <- function(x, y, n, n_comp, which_comp){
#Frame for all predictor data
x_frame <- data.frame()
#Frame for all clinical data
y_frame <- data.frame()
#Frame for predictions
pred_frame <- data.frame()
#list() for results
res <- list()
for (i in 1:n){
#Sample from all columns
tmp_x <- x[sample(nrow(x), nrow(x)/2), ]
#rbind predictors (mostly a waste of memory)
#x_frame <- rbind(x_frame, cbind(row.names(tmp_x), tmp_x))
#Pull clinical data for samples
tmp_y <- y[which(row.names(y) %in% row.names(tmp_x)),]
#rbind clinical data (mostly a waste of memory)
#y_frame <- rbind(y_frame, cbind(row.names(tmp_y), tmp_y))
#perform regression
clin_pls <- pls(tmp_x, tmp_y, ncomp = n_comp, mode = "regression")
#Predict for ALL values of x
pls_pred <- as.data.frame(predict(clin_pls, x)$predict)
#Subset predictions from component with which_comp
pls_pred <- pls_pred[ ,grepl(which_comp, colnames(pls_pred))]
#Assign proper column names
colnames(pls_pred) <- colnames(y)
pred_frame <- rbind(pred_frame, cbind(row.names(pls_pred), pls_pred))
}
#Create results object to return
#res[["x_frame"]] <- x_frame
#res[["y_frame"]] <- y_frame
## For now, only predictions-- waste of memory
res[["pred_frame"]] <- pred_frame
return(res)
}
data("liver.toxicity")
#Structure of liver.toxicity, but not gene because it's >3k variables
str(liver.toxicity$clinic)
## 'data.frame': 64 obs. of 10 variables:
## $ BUN.mg.dL. : num 12 15 17 19 20 17 16 14 12 11 ...
## $ Creat.mg.dL. : num 0.6 0.7 0.8 0.7 0.7 0.7 0.7 0.6 0.7 0.7 ...
## $ TP.g.dL. : num 7.3 7.5 7.6 7.5 7.8 7.7 7.6 7.4 7.5 7.3 ...
## $ ALB.g.dL. : num 4.9 5 5 5 5.1 5.1 5.1 4.9 5.1 4.9 ...
## $ ALT.IU.L. : int 73 84 54 369 59 57 67 57 54 97 ...
## $ SDH.IU.L. : int 23 23 14 81 16 10 21 19 20 26 ...
## $ AST.IU.L. : int 87 110 129 450 75 80 93 90 76 135 ...
## $ ALP.IU.L. : int 320 325 381 401 359 335 324 309 317 251 ...
## $ TBA.umol.L. : int 9 8 4 13 3 13 8 11 6 7 ...
## $ Cholesterol.mg.dL.: int 85 87 84 82 85 88 82 92 86 77 ...
str(liver.toxicity$treatment)
## 'data.frame': 64 obs. of 4 variables:
## $ Animal.Number : int 202 203 204 206 208 209 210 211 212 213 ...
## $ Treatment.Group: Factor w/ 16 levels "1500MG/K18 HR",..: 16 16 13 13 14 14 15 15 15 16 ...
## $ Dose.Group : num 50 50 50 50 50 50 50 50 50 50 ...
## $ Time.Group : int 6 6 18 18 24 24 48 48 48 6 ...
In this instance, we will be using genomic data to predict clinical phenotypes, so our predictors, x will be analyzed alongside our clinical outcomes, y.
x <- liver.toxicity$gene
y <- liver.toxicity$clinic
head(row.names(x))
## [1] "ID202" "ID203" "ID204" "ID206" "ID208" "ID209"
head(row.names(y))
## [1] "202" "203" "204" "206" "208" "209"
#Need to correct format so IDs are uniform
row.names(y) <- paste0("ID",row.names(y))
Because x contains more than 3,000 variables, I am only interested in ensuring that the data are complete. Otherwise, I will summarize y.
all(complete.cases(x))
## [1] TRUE
summary(y)
## BUN.mg.dL. Creat.mg.dL. TP.g.dL. ALB.g.dL.
## Min. :11.00 Min. :0.6000 Min. :6.300 Min. :4.400
## 1st Qu.:14.00 1st Qu.:0.6750 1st Qu.:7.300 1st Qu.:4.900
## Median :16.50 Median :0.7000 Median :7.500 Median :5.000
## Mean :17.04 Mean :0.6877 Mean :7.475 Mean :5.014
## 3rd Qu.:20.00 3rd Qu.:0.7000 3rd Qu.:7.725 3rd Qu.:5.200
## Max. :28.00 Max. :0.8000 Max. :8.400 Max. :5.600
## ALT.IU.L. SDH.IU.L. AST.IU.L. ALP.IU.L.
## Min. : 48.0 Min. : 2.0 Min. : 75.0 Min. :141.0
## 1st Qu.: 63.0 1st Qu.: 15.0 1st Qu.: 89.0 1st Qu.:303.0
## Median : 94.5 Median : 22.5 Median : 151.5 Median :332.5
## Mean : 1677.8 Mean : 79.8 Mean : 3375.3 Mean :338.2
## 3rd Qu.: 1137.0 3rd Qu.: 61.0 3rd Qu.: 1502.8 3rd Qu.:380.2
## Max. :15180.0 Max. :995.0 Max. :27075.0 Max. :500.0
## TBA.umol.L. Cholesterol.mg.dL.
## Min. : 3.00 Min. : 42.00
## 1st Qu.: 5.00 1st Qu.: 78.00
## Median : 9.00 Median : 84.50
## Mean :20.27 Mean : 82.62
## 3rd Qu.:26.00 3rd Qu.: 91.25
## Max. :98.00 Max. :111.00
We will run a PCA on both the predictors (the genes) and outcomes (the clinical values). The data will be normalized using centered and scaled. For the number of components at the outset, we’ll use 10.
x_pca <- pca(x, ncomp = 10, center = TRUE, scale = TRUE)
x_pca$explained_variance
## PC1 PC2 PC3 PC4 PC5 PC6
## 0.28073776 0.14862034 0.07963801 0.05377996 0.04555745 0.03925671
## PC7 PC8 PC9 PC10
## 0.03490533 0.02495056 0.02380736 0.01811799
y_pca <- pca(y, ncomp = 10, center = TRUE, scale = TRUE)
y_pca$explained_variance
## PC1 PC2 PC3 PC4 PC5
## 0.4083260500 0.2219951947 0.1247262301 0.0894716350 0.0675342967
## PC6 PC7 PC8 PC9 PC10
## 0.0483596842 0.0261923034 0.0089045216 0.0036440620 0.0008460224
In the case of the raw predictors, the majority of variance is explained by the first three principal components, while the variance in outcomes is explained by its first two principal components.
plot(x_pca)
plotVar(x_pca, comp = c(1, 2), var.names = FALSE, title = "Genomic Values, PCA comp 1 - 2")
plot(y_pca)
plotVar(y_pca, comp = c(1, 2), var.names = TRUE, title = "Clincal Values, PCA comp 1- 2")
plotIndiv(x_pca, comp = c(1, 2), group = liver.toxicity$treatment$Time.Group, ellipse = TRUE,
legend = TRUE, title = "Genomic Values, PCA comp 1 - 2")
plotIndiv(y_pca, comp = c(1, 2), group = liver.toxicity$treatment$Time.Group, ind.names = liver.toxicity$treatment$Dose.Group, ellipse = TRUE,
legend = TRUE, title = "Clinical Values, PCA comp 1 - 2")
Apply nearZeroVar, conveniently taken from caret to the genomic data.
zer_var <- nearZeroVar(x)
print(zer_var)
## $Position
## integer(0)
##
## $Metrics
## [1] freqRatio percentUnique
## <0 rows> (or 0-length row.names)
No predictors qualify for removal.
Initially, we will use three components (as determined from the PCA).
clin_pls <- pls(x, y, ncomp = 3, mode = "regression")
## Note: Results are in terms of unit variance-- the further away from the center, the stronger the association
plotVar(clin_pls, overlap = F)
plotIndiv(clin_pls, group = liver.toxicity$treatment$Time.Group, ind.names = liver.toxicity$treatment$Dose.Group, legend = T, ellipse = T)
Note: CIMs are not rendering to markdown.
##Flush graphics device
#graphics.off()
#par(mar=c(1,1,1,1))
#cim(clin_pls, comp = 1:3, xlab = "clinic", ylab = "genes", margins = c(10, 10), title = "Clinical PLS Clustered Image Map")
perf() will be used to compute evaluation critera for the PLS model. For cross-validation, Mfold will be used rather than leave-one-out. nrepeat will be set to 50. We will set the progressBar to false, because the calculations aren’t large enough to warrant its use.
tune_pls <- perf(clin_pls, validation = "Mfold", folds = 10, nrepeat = 50, progressBar = FALSE)
plot(tune_pls$Q2.total, ylim = c(-0.5, 0.5),
main = "PLS Q^2 Values by Principal Component",
xlab = "Principal Component",
ylab = "Q^2 Value")
abline(h = 0.0975)
Note: Tenenhaus, M. (1998) suggests using a cuttoff of 0.0975 for Q2 values, where any component with values lower than the cutoff are not included in the model.
#Note: to grab component names: attr(tune_pls$Q2.total, "dimnames")[[1]]
print(tune_pls$R2)
## 1 comp 2 comp 3 comp
## BUN.mg.dL. 0.5072290840 5.486573e-01 0.53246529
## Creat.mg.dL. 0.0147531561 7.767710e-03 0.01255874
## TP.g.dL. 0.0164111731 6.372609e-02 0.24417729
## ALB.g.dL. 0.0001900945 1.732280e-01 0.39641077
## ALT.IU.L. 0.6117601848 7.278652e-01 0.72898354
## SDH.IU.L. 0.0128129810 1.801484e-06 0.04153585
## AST.IU.L. 0.5649265536 6.951631e-01 0.70516857
## ALP.IU.L. 0.1100127222 9.985983e-02 0.08351555
## TBA.umol.L. 0.5675273723 5.534675e-01 0.68324433
## Cholesterol.mg.dL. 0.2070860171 3.198604e-01 0.30729239
### PLS
#I only want the predictions, and in data frame format
pls_pred <- as.data.frame(predict(clin_pls, x)$predict)
pls_pred_one <- pls_pred[, grepl("dim 1", colnames(pls_pred))]
pls_pred_two <- pls_pred[, grepl("dim 2", colnames(pls_pred))]
performance <- rbind(metrics(y, pls_pred_one),
metrics(y, pls_pred_two))
row.names(performance) <- c("PLS Comp One",
"PLS Comp Two")
cat("Performance (RMSE):\n")
## Performance (RMSE):
print(performance)
## BUN.mg.dL. Creat.mg.dL. TP.g.dL. ALB.g.dL. ALT.IU.L.
## PLS Comp One 2.880435 0.05771321 0.3870113 0.2545776 1924.443
## PLS Comp Two 2.730955 0.05759611 0.3588442 0.2188154 1551.441
## SDH.IU.L. AST.IU.L. ALP.IU.L. TBA.umol.L. Cholesterol.mg.dL.
## PLS Comp One 155.9711 4309.313 53.73584 14.44731 11.89491
## PLS Comp Two 153.6316 3492.627 53.65465 14.43709 10.43855
Since we have so few observations (64), bootstrapping may be more appropriate for evaluating performance. We will sample 500 times using 3 principal components.
We will assess the results of the first principal component.
system.time(res <- bootstrap_pls(x, y, 500, 3, "dim 1"))
## user system elapsed
## 57.321 3.544 61.667
str(res)
## List of 1
## $ pred_frame:'data.frame': 32000 obs. of 11 variables:
## ..$ row.names(pls_pred): Factor w/ 64 levels "ID202","ID203",..: 1 2 3 4 5 6 7 8 9 10 ...
## ..$ BUN.mg.dL. : num [1:32000] 19 18.8 15.7 15.8 18.1 ...
## ..$ Creat.mg.dL. : num [1:32000] 0.723 0.719 0.666 0.669 0.707 ...
## ..$ TP.g.dL. : num [1:32000] 7.56 7.54 7.29 7.31 7.49 ...
## ..$ ALB.g.dL. : num [1:32000] 5.09 5.07 4.87 4.88 5.02 ...
## ..$ ALT.IU.L. : num [1:32000] 1482 1463 1203 1217 1404 ...
## ..$ SDH.IU.L. : num [1:32000] 114 114 116 116 115 ...
## ..$ AST.IU.L. : num [1:32000] 3037 3011 2674 2693 2935 ...
## ..$ ALP.IU.L. : num [1:32000] 371 368 327 329 358 ...
## ..$ TBA.umol.L. : num [1:32000] 19 19.3 23.9 23.6 20.3 ...
## ..$ Cholesterol.mg.dL. : num [1:32000] 80.1 80.4 85.1 84.9 81.5 ...
Let’s subset a random ID to take a look at some of the results
tmp <- as.character(res$pred_frame[sample(nrow(res$pred_frame), 1), ]$`row.names(pls_pred)`)
tmp <- res$pred_frame[(res$pred_frame$`row.names(pls_pred)` == tmp),]
hist(tmp$BUN.mg.dL.)
hist(tmp$Cholesterol.mg.dL.)
rm(tmp)
Looks normal.
## Subset predictions
res <- res$pred_frame
## Rename ID column, convert to character
colnames(res)[1] <- "ID"
res$ID <- as.character(res$ID)
## Aggregate means
clin_res_boot <- aggregate(res, by = list(res$ID), FUN = mean)
## Clean
row.names(clin_res_boot) <- clin_res_boot$Group.1
clin_res_boot <- clin_res_boot[,3:ncol(clin_res_boot)]
## Performance
cat("Root Mean Square Errors:")
## Root Mean Square Errors:
metrics(y, clin_res_boot)
## BUN.mg.dL. Creat.mg.dL. TP.g.dL. ALB.g.dL. ALT.IU.L. SDH.IU.L.
## 1 4.268934 0.05988033 0.3874173 0.2581309 3361.192 159.9192
## AST.IU.L. ALP.IU.L. TBA.umol.L. Cholesterol.mg.dL.
## 1 7078.354 60.23563 23.0043 14.01628
## Actual vs Predicted Values
cat("Predicted:\n")
## Predicted:
print(head(round(clin_res_boot, 2)))
## BUN.mg.dL. Creat.mg.dL. TP.g.dL. ALB.g.dL. ALT.IU.L. SDH.IU.L.
## ID202 16.95 0.69 7.47 5.01 1556.34 81.58
## ID203 16.98 0.69 7.47 5.01 1589.40 81.64
## ID204 17.02 0.69 7.47 5.01 1631.84 78.78
## ID206 16.98 0.69 7.47 5.01 1589.43 78.25
## ID208 16.93 0.69 7.47 5.01 1539.99 80.44
## ID209 16.99 0.69 7.48 5.01 1570.10 81.93
## AST.IU.L. ALP.IU.L. TBA.umol.L. Cholesterol.mg.dL.
## ID202 3126.74 337.81 19.55 82.91
## ID203 3196.23 338.12 19.73 82.82
## ID204 3260.03 337.87 19.76 82.71
## ID206 3170.02 337.64 19.44 82.82
## ID208 3083.86 337.52 19.32 82.95
## ID209 3160.15 338.12 19.65 82.86
cat("Actual:\n")
## Actual:
print(head(y))
## BUN.mg.dL. Creat.mg.dL. TP.g.dL. ALB.g.dL. ALT.IU.L. SDH.IU.L.
## ID202 12 0.6 7.3 4.9 73 23
## ID203 15 0.7 7.5 5.0 84 23
## ID204 17 0.8 7.6 5.0 54 14
## ID206 19 0.7 7.5 5.0 369 81
## ID208 20 0.7 7.8 5.1 59 16
## ID209 17 0.7 7.7 5.1 57 10
## AST.IU.L. ALP.IU.L. TBA.umol.L. Cholesterol.mg.dL.
## ID202 87 320 9 85
## ID203 110 325 8 87
## ID204 129 381 4 84
## ID206 450 401 13 82
## ID208 75 359 3 85
## ID209 80 335 13 88
Some predictions are very good, and others are not so good. The bootstrapped method did not outperform the straightforward pls method.