This is a Principal Component Analysis, drawing heavily from mixOmics’ Case Studies:

In the liver toxicity case study, 64 observations of 3116 gene expression values were used in an effort to discover which genes accounted for the variance seen in clinical variables, and could thus be used to predict the values of those clinical variables.

Unfortunately, a cursory glance at data sets from either Scientific Data or Bioconductor did not have genomic and associated phenotypic data, so we will use the liver.toxicity data set that comes with mixOmics, because it has excellent characteristics for what we would like to do.

Strategy

  1. Load liver.toxicity$gene data as predictors, and live.toxicity$clinic data as outcomes.
  2. Summarize data, ensure it’s complete.
  3. Run PCA on predictors and outcomes.
  4. Apply nearZeroVar to gene data to drop genes unlikely to have sufficient variance to inform outcomes.
  5. Re-run PCA on predictors (out of curiosity, and only if some predictors are removed).
  6. Run an initial PLSA Regression and Generate Predictions
  7. Evaluate Predictions (using Root Mean Squared Error)
  8. Bootstrap PLSA Regression/Predictions (due to low observation number: 64)
  9. Evaluate Predictions (RMSE)

Library Imports

if (!require("mixOmics")) install.packages("mixOmics")
#Note: rgl installation failed-- installed XQuartz for X11
#https://www.xquartz.org/

Utility Functions

rmse

rmse() will accept the actual and predicted numeric values, and will return the root mean squared error.

rmse <- function(actual, predicted){
    return(sqrt(mean((actual-predicted)^2)))
}

metrics

metrics() will iterate over the variables in the predicted data frame to generate root mean squared error stats for each variable.

metrics <- function(actual, predicted){
    #Hold the data to return
    res_frame <- data.frame()
    #Create vector for holding values
    res_vec <- vector()
    for (i in 1:ncol(actual)){
        res_vec[length(res_vec)+1] <- rmse(actual[,i], predicted[,i])
    }
    
    #Bind
    res_frame <- rbind(res_frame, res_vec)
    colnames(res_frame) <- colnames(actual)
    
    return(res_frame)
}

bootstrap_pls

bootstrap_pls() will randomly sample 50% of observations from the predictors and generate clinical predictions, utilizing pls, from the model n times utilizing n_comp number of components and component number which_comp in the string form of "dim comp_num".

bootstrap_pls <- function(x, y, n, n_comp, which_comp){
    #Frame for all predictor data
    x_frame <- data.frame()

    #Frame for all clinical data
    y_frame <- data.frame()

    #Frame for predictions
    pred_frame <- data.frame()

    #list() for results
    res <- list()

    for (i in 1:n){
        #Sample from all columns
        tmp_x <- x[sample(nrow(x), nrow(x)/2), ]

        #rbind predictors (mostly a waste of memory)
        #x_frame <- rbind(x_frame, cbind(row.names(tmp_x), tmp_x))

        #Pull clinical data for samples
        tmp_y <- y[which(row.names(y) %in% row.names(tmp_x)),]

        #rbind clinical data (mostly a waste of memory)
        #y_frame <- rbind(y_frame, cbind(row.names(tmp_y), tmp_y))

        #perform regression
        clin_pls <- pls(tmp_x, tmp_y, ncomp = n_comp, mode = "regression")

        #Predict for ALL values of x
        pls_pred <- as.data.frame(predict(clin_pls, x)$predict)

        #Subset predictions from component with which_comp
        pls_pred <- pls_pred[ ,grepl(which_comp, colnames(pls_pred))]


        #Assign proper column names
        colnames(pls_pred) <- colnames(y)

        pred_frame <- rbind(pred_frame, cbind(row.names(pls_pred), pls_pred))
    }

    #Create results object to return
    #res[["x_frame"]] <- x_frame
    #res[["y_frame"]] <- y_frame
    
    ## For now, only predictions-- waste of memory
    res[["pred_frame"]] <- pred_frame

    return(res)
}

Load Data & View

data("liver.toxicity")

#Structure of liver.toxicity, but not gene because it's >3k variables
str(liver.toxicity$clinic)
## 'data.frame':    64 obs. of  10 variables:
##  $ BUN.mg.dL.        : num  12 15 17 19 20 17 16 14 12 11 ...
##  $ Creat.mg.dL.      : num  0.6 0.7 0.8 0.7 0.7 0.7 0.7 0.6 0.7 0.7 ...
##  $ TP.g.dL.          : num  7.3 7.5 7.6 7.5 7.8 7.7 7.6 7.4 7.5 7.3 ...
##  $ ALB.g.dL.         : num  4.9 5 5 5 5.1 5.1 5.1 4.9 5.1 4.9 ...
##  $ ALT.IU.L.         : int  73 84 54 369 59 57 67 57 54 97 ...
##  $ SDH.IU.L.         : int  23 23 14 81 16 10 21 19 20 26 ...
##  $ AST.IU.L.         : int  87 110 129 450 75 80 93 90 76 135 ...
##  $ ALP.IU.L.         : int  320 325 381 401 359 335 324 309 317 251 ...
##  $ TBA.umol.L.       : int  9 8 4 13 3 13 8 11 6 7 ...
##  $ Cholesterol.mg.dL.: int  85 87 84 82 85 88 82 92 86 77 ...
str(liver.toxicity$treatment)
## 'data.frame':    64 obs. of  4 variables:
##  $ Animal.Number  : int  202 203 204 206 208 209 210 211 212 213 ...
##  $ Treatment.Group: Factor w/ 16 levels "1500MG/K18 HR",..: 16 16 13 13 14 14 15 15 15 16 ...
##  $ Dose.Group     : num  50 50 50 50 50 50 50 50 50 50 ...
##  $ Time.Group     : int  6 6 18 18 24 24 48 48 48 6 ...

Subsetting and Exploration

In this instance, we will be using genomic data to predict clinical phenotypes, so our predictors, x will be analyzed alongside our clinical outcomes, y.

x <- liver.toxicity$gene
y <- liver.toxicity$clinic

head(row.names(x))
## [1] "ID202" "ID203" "ID204" "ID206" "ID208" "ID209"
head(row.names(y))
## [1] "202" "203" "204" "206" "208" "209"
#Need to correct format so IDs are uniform
row.names(y) <- paste0("ID",row.names(y))

Because x contains more than 3,000 variables, I am only interested in ensuring that the data are complete. Otherwise, I will summarize y.

all(complete.cases(x))
## [1] TRUE
summary(y)
##    BUN.mg.dL.     Creat.mg.dL.       TP.g.dL.       ALB.g.dL.    
##  Min.   :11.00   Min.   :0.6000   Min.   :6.300   Min.   :4.400  
##  1st Qu.:14.00   1st Qu.:0.6750   1st Qu.:7.300   1st Qu.:4.900  
##  Median :16.50   Median :0.7000   Median :7.500   Median :5.000  
##  Mean   :17.04   Mean   :0.6877   Mean   :7.475   Mean   :5.014  
##  3rd Qu.:20.00   3rd Qu.:0.7000   3rd Qu.:7.725   3rd Qu.:5.200  
##  Max.   :28.00   Max.   :0.8000   Max.   :8.400   Max.   :5.600  
##    ALT.IU.L.         SDH.IU.L.       AST.IU.L.         ALP.IU.L.    
##  Min.   :   48.0   Min.   :  2.0   Min.   :   75.0   Min.   :141.0  
##  1st Qu.:   63.0   1st Qu.: 15.0   1st Qu.:   89.0   1st Qu.:303.0  
##  Median :   94.5   Median : 22.5   Median :  151.5   Median :332.5  
##  Mean   : 1677.8   Mean   : 79.8   Mean   : 3375.3   Mean   :338.2  
##  3rd Qu.: 1137.0   3rd Qu.: 61.0   3rd Qu.: 1502.8   3rd Qu.:380.2  
##  Max.   :15180.0   Max.   :995.0   Max.   :27075.0   Max.   :500.0  
##   TBA.umol.L.    Cholesterol.mg.dL.
##  Min.   : 3.00   Min.   : 42.00    
##  1st Qu.: 5.00   1st Qu.: 78.00    
##  Median : 9.00   Median : 84.50    
##  Mean   :20.27   Mean   : 82.62    
##  3rd Qu.:26.00   3rd Qu.: 91.25    
##  Max.   :98.00   Max.   :111.00

Initial Principal Component Analysis

We will run a PCA on both the predictors (the genes) and outcomes (the clinical values). The data will be normalized using centered and scaled. For the number of components at the outset, we’ll use 10.

x_pca <- pca(x, ncomp = 10, center = TRUE, scale = TRUE)
x_pca$explained_variance
##        PC1        PC2        PC3        PC4        PC5        PC6 
## 0.28073776 0.14862034 0.07963801 0.05377996 0.04555745 0.03925671 
##        PC7        PC8        PC9       PC10 
## 0.03490533 0.02495056 0.02380736 0.01811799
y_pca <- pca(y, ncomp = 10, center = TRUE, scale = TRUE)
y_pca$explained_variance
##          PC1          PC2          PC3          PC4          PC5 
## 0.4083260500 0.2219951947 0.1247262301 0.0894716350 0.0675342967 
##          PC6          PC7          PC8          PC9         PC10 
## 0.0483596842 0.0261923034 0.0089045216 0.0036440620 0.0008460224

In the case of the raw predictors, the majority of variance is explained by the first three principal components, while the variance in outcomes is explained by its first two principal components.

Plotting

plot(x_pca)

plotVar(x_pca, comp = c(1, 2), var.names = FALSE, title = "Genomic Values, PCA comp 1 - 2")

plot(y_pca)

plotVar(y_pca, comp = c(1, 2), var.names = TRUE, title = "Clincal Values, PCA comp 1- 2")

plotIndiv(x_pca, comp = c(1, 2), group = liver.toxicity$treatment$Time.Group, ellipse = TRUE,
legend = TRUE, title = "Genomic Values, PCA comp 1 - 2")

plotIndiv(y_pca, comp = c(1, 2), group = liver.toxicity$treatment$Time.Group, ind.names = liver.toxicity$treatment$Dose.Group, ellipse = TRUE,
legend = TRUE, title = "Clinical Values, PCA comp 1 - 2")

Near-Zero Variance Removal

Apply nearZeroVar, conveniently taken from caret to the genomic data.

zer_var <- nearZeroVar(x)
print(zer_var)
## $Position
## integer(0)
## 
## $Metrics
## [1] freqRatio     percentUnique
## <0 rows> (or 0-length row.names)

No predictors qualify for removal.

Initial Partial Least Squares

Initially, we will use three components (as determined from the PCA).

clin_pls <- pls(x, y, ncomp = 3, mode = "regression")

## Note: Results are in terms of unit variance-- the further away from the center, the stronger the association
plotVar(clin_pls, overlap = F)

Plotting

Components

plotIndiv(clin_pls, group = liver.toxicity$treatment$Time.Group, ind.names = liver.toxicity$treatment$Dose.Group, legend = T, ellipse = T)

Clustered Image Maps

Note: CIMs are not rendering to markdown.

##Flush graphics device
#graphics.off()
#par(mar=c(1,1,1,1))

#cim(clin_pls, comp = 1:3, xlab = "clinic", ylab = "genes", margins = c(10, 10), title = "Clinical PLS Clustered Image Map")

Tuning and Evaluation

perf() will be used to compute evaluation critera for the PLS model. For cross-validation, Mfold will be used rather than leave-one-out. nrepeat will be set to 50. We will set the progressBar to false, because the calculations aren’t large enough to warrant its use.

tune_pls <- perf(clin_pls, validation = "Mfold", folds = 10, nrepeat = 50, progressBar = FALSE)

plot(tune_pls$Q2.total, ylim = c(-0.5, 0.5), 
    main = "PLS Q^2 Values by Principal Component", 
    xlab = "Principal Component", 
    ylab = "Q^2 Value")
abline(h = 0.0975)

Note: Tenenhaus, M. (1998) suggests using a cuttoff of 0.0975 for Q2 values, where any component with values lower than the cutoff are not included in the model.

Coefficient of Determination

#Note: to grab component names: attr(tune_pls$Q2.total, "dimnames")[[1]]
print(tune_pls$R2)
##                          1 comp       2 comp     3 comp
## BUN.mg.dL.         0.5072290840 5.486573e-01 0.53246529
## Creat.mg.dL.       0.0147531561 7.767710e-03 0.01255874
## TP.g.dL.           0.0164111731 6.372609e-02 0.24417729
## ALB.g.dL.          0.0001900945 1.732280e-01 0.39641077
## ALT.IU.L.          0.6117601848 7.278652e-01 0.72898354
## SDH.IU.L.          0.0128129810 1.801484e-06 0.04153585
## AST.IU.L.          0.5649265536 6.951631e-01 0.70516857
## ALP.IU.L.          0.1100127222 9.985983e-02 0.08351555
## TBA.umol.L.        0.5675273723 5.534675e-01 0.68324433
## Cholesterol.mg.dL. 0.2070860171 3.198604e-01 0.30729239

Initial Predictions

### PLS
#I only want the predictions, and in data frame format
pls_pred <- as.data.frame(predict(clin_pls, x)$predict)

pls_pred_one <- pls_pred[, grepl("dim 1", colnames(pls_pred))]

pls_pred_two <- pls_pred[, grepl("dim 2", colnames(pls_pred))]

Initial Performance (RMSE)

performance <- rbind(metrics(y, pls_pred_one),
                     metrics(y, pls_pred_two))

row.names(performance) <- c("PLS Comp One",
                            "PLS Comp Two")

cat("Performance (RMSE):\n")
## Performance (RMSE):
print(performance)
##              BUN.mg.dL. Creat.mg.dL.  TP.g.dL. ALB.g.dL. ALT.IU.L.
## PLS Comp One   2.880435   0.05771321 0.3870113 0.2545776  1924.443
## PLS Comp Two   2.730955   0.05759611 0.3588442 0.2188154  1551.441
##              SDH.IU.L. AST.IU.L. ALP.IU.L. TBA.umol.L. Cholesterol.mg.dL.
## PLS Comp One  155.9711  4309.313  53.73584    14.44731           11.89491
## PLS Comp Two  153.6316  3492.627  53.65465    14.43709           10.43855

Bootstrapping

Since we have so few observations (64), bootstrapping may be more appropriate for evaluating performance. We will sample 500 times using 3 principal components.

We will assess the results of the first principal component.

system.time(res <- bootstrap_pls(x, y, 500, 3, "dim 1"))
##    user  system elapsed 
##  57.321   3.544  61.667
str(res)
## List of 1
##  $ pred_frame:'data.frame':  32000 obs. of  11 variables:
##   ..$ row.names(pls_pred): Factor w/ 64 levels "ID202","ID203",..: 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ BUN.mg.dL.         : num [1:32000] 19 18.8 15.7 15.8 18.1 ...
##   ..$ Creat.mg.dL.       : num [1:32000] 0.723 0.719 0.666 0.669 0.707 ...
##   ..$ TP.g.dL.           : num [1:32000] 7.56 7.54 7.29 7.31 7.49 ...
##   ..$ ALB.g.dL.          : num [1:32000] 5.09 5.07 4.87 4.88 5.02 ...
##   ..$ ALT.IU.L.          : num [1:32000] 1482 1463 1203 1217 1404 ...
##   ..$ SDH.IU.L.          : num [1:32000] 114 114 116 116 115 ...
##   ..$ AST.IU.L.          : num [1:32000] 3037 3011 2674 2693 2935 ...
##   ..$ ALP.IU.L.          : num [1:32000] 371 368 327 329 358 ...
##   ..$ TBA.umol.L.        : num [1:32000] 19 19.3 23.9 23.6 20.3 ...
##   ..$ Cholesterol.mg.dL. : num [1:32000] 80.1 80.4 85.1 84.9 81.5 ...

Results Exploration

Let’s subset a random ID to take a look at some of the results

tmp <- as.character(res$pred_frame[sample(nrow(res$pred_frame), 1), ]$`row.names(pls_pred)`)

tmp <- res$pred_frame[(res$pred_frame$`row.names(pls_pred)` == tmp),]

hist(tmp$BUN.mg.dL.)

hist(tmp$Cholesterol.mg.dL.)

rm(tmp)

Looks normal.

Prediction Performance

## Subset predictions
res <- res$pred_frame

## Rename ID column, convert to character
colnames(res)[1] <- "ID"
res$ID <- as.character(res$ID)

## Aggregate means
clin_res_boot <- aggregate(res, by = list(res$ID), FUN = mean)

## Clean
row.names(clin_res_boot) <- clin_res_boot$Group.1
clin_res_boot <- clin_res_boot[,3:ncol(clin_res_boot)]

## Performance
cat("Root Mean Square Errors:")
## Root Mean Square Errors:
metrics(y, clin_res_boot)
##   BUN.mg.dL. Creat.mg.dL.  TP.g.dL. ALB.g.dL. ALT.IU.L. SDH.IU.L.
## 1   4.268934   0.05988033 0.3874173 0.2581309  3361.192  159.9192
##   AST.IU.L. ALP.IU.L. TBA.umol.L. Cholesterol.mg.dL.
## 1  7078.354  60.23563     23.0043           14.01628
## Actual vs Predicted Values
cat("Predicted:\n")
## Predicted:
print(head(round(clin_res_boot, 2)))
##       BUN.mg.dL. Creat.mg.dL. TP.g.dL. ALB.g.dL. ALT.IU.L. SDH.IU.L.
## ID202      16.95         0.69     7.47      5.01   1556.34     81.58
## ID203      16.98         0.69     7.47      5.01   1589.40     81.64
## ID204      17.02         0.69     7.47      5.01   1631.84     78.78
## ID206      16.98         0.69     7.47      5.01   1589.43     78.25
## ID208      16.93         0.69     7.47      5.01   1539.99     80.44
## ID209      16.99         0.69     7.48      5.01   1570.10     81.93
##       AST.IU.L. ALP.IU.L. TBA.umol.L. Cholesterol.mg.dL.
## ID202   3126.74    337.81       19.55              82.91
## ID203   3196.23    338.12       19.73              82.82
## ID204   3260.03    337.87       19.76              82.71
## ID206   3170.02    337.64       19.44              82.82
## ID208   3083.86    337.52       19.32              82.95
## ID209   3160.15    338.12       19.65              82.86
cat("Actual:\n")
## Actual:
print(head(y))
##       BUN.mg.dL. Creat.mg.dL. TP.g.dL. ALB.g.dL. ALT.IU.L. SDH.IU.L.
## ID202         12          0.6      7.3       4.9        73        23
## ID203         15          0.7      7.5       5.0        84        23
## ID204         17          0.8      7.6       5.0        54        14
## ID206         19          0.7      7.5       5.0       369        81
## ID208         20          0.7      7.8       5.1        59        16
## ID209         17          0.7      7.7       5.1        57        10
##       AST.IU.L. ALP.IU.L. TBA.umol.L. Cholesterol.mg.dL.
## ID202        87       320           9                 85
## ID203       110       325           8                 87
## ID204       129       381           4                 84
## ID206       450       401          13                 82
## ID208        75       359           3                 85
## ID209        80       335          13                 88

Some predictions are very good, and others are not so good. The bootstrapped method did not outperform the straightforward pls method.