Project 2 (draft due 11/26, final due 12/3)
Introduction
The recent development of several biochemical analytical platforms (DNA and RNA arrays, next-generation sequencing, proteomic MS, and metabolomic NMR and LCMS) hold immense potential for identifying biomarkers of disease. Although there is a relatively low number of subjects, the data provides thousands of variables, a characteristic common in many types of genomic and other “–omic” datasets. Methods utilzing these highly dimensional analyses is often cumbersome; statistical models are commonly subject to overfitting.
The goal of the current project was to generate metabolomic (lipid) profiles from the cohort. The data discussed here is for a proof-of-concept study that used blood serum samples from a previous cohort of healthy volunteers collected several years prior. The blood serum samples were collected from 158 individuals and later pooled into 35 groups with four or five individuals which were matched on gender, race, and smoking status. Using analytical chemistry techniques, specifically liquid chromotography mass spectrometry (LCMS), lipid profiles were generated from the pooled 35 samples.
The resulting data includes the mass of detected features in a serum sample as well as retention time, along with the relative absorbanc, which can be interpreted as concentration, of each feature. Over 3000 features were generated in the positive and negative mass spectormetry (MS) ionization modes, where each feature represenets a specific ion of a compound in the subjects' serum.
Biostatisticians at the University of California at Berkeley have analyzed our data using univariate analysis (ANOVA) with Benjamini-Hochberg correction for multiple tests. This analysis resulted in approximately 30 features that were shown to have biologically significant differences in concentration between the various groups.
For the current project, I will explore the previously described dataset through several statistical methods discussed in this course, including multi-variate analysis. The aim is to understand the trends within the dataset as well as to perform different analyses that would identify discriminating features among the thousands detected.
These statistical analyses may provide a framework for future studies of diseased and non-diseased subjects that are capable of detecting discriminating predictive biomarkers.
Demonstration
library(Biobase) #several of these packages are from bioconductor
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, as.data.frame, cbind, colnames, duplicated,
## eval, Filter, Find, get, intersect, lapply, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rep.int, rownames, sapply,
## setdiff, sort, table, tapply, union, unique, unlist
##
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
library(ggplot2)
library(ggbiplot)
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
library(cluster)
library(reshape2)
library(limma)
library(glmnet)
## Loading required package: Matrix
## Loading required package: lattice
## Loaded glmnet 1.9-5
library(lasso2)
## R Package to solve regression problems while imposing
## an L1 constraint on the parameters. Based on S-plus Release 2.1
## Copyright (C) 1998, 1999
## Justin Lokhorst <jlokhors@stats.adelaide.edu.au>
## Berwin A. Turlach <bturlach@stats.adelaide.edu.au>
## Bill Venables <wvenable@stats.adelaide.edu.au>
##
## Copyright (C) 2002
## Martin Maechler <maechler@stat.math.ethz.ch>
setwd("~/Dropbox/PH251D/final-project-1/")
pos <- read.csv("../data/xiaoming_FT_positive_final.csv", row.names = 1, nrows = 2862)
head(pos)
## X1101 X1102 X1103 X1104 X1201 X1202 X1203
## 756.55692\t_\t21.459 30.030 28.86 35.390 37.07 38.115 29.69 35.769
## 759.64089\t_\t22.003 8.576 10.69 8.899 10.16 8.242 12.42 9.116
## 807.57778\t_\t21.597 18.335 17.63 13.978 20.22 15.279 19.42 13.061
## 783.57649\t_\t21.621 55.250 58.21 38.004 47.40 39.860 56.98 35.046
## 834.60559\t_\t21.861 14.166 12.87 11.378 17.25 13.694 15.12 12.822
## 813.68440\t_\t22.221 88.078 71.84 103.003 94.74 108.334 84.81 123.152
## X1204 X1205 X2101 X2102 X2103 X2104 X2201
## 756.55692\t_\t21.459 29.148 31.05 37.000 35.381 37.866 37.173 42.47
## 759.64089\t_\t22.003 9.962 10.67 9.092 8.752 9.883 9.095 10.27
## 807.57778\t_\t21.597 13.084 19.69 17.741 14.581 12.764 16.625 14.01
## 783.57649\t_\t21.621 34.413 44.83 52.295 39.139 43.340 39.730 38.96
## 834.60559\t_\t21.861 10.900 17.73 13.052 10.924 9.808 13.485 11.19
## 813.68440\t_\t22.221 98.114 128.83 75.811 91.681 65.383 101.850 109.54
## X2202 X2203 X1111 X1112 X1113 X1114 X1115
## 756.55692\t_\t21.459 42.63 42.146 32.873 31.773 45.03 32.333 33.311
## 759.64089\t_\t22.003 8.93 9.839 8.519 5.664 10.53 7.225 8.318
## 807.57778\t_\t21.597 14.33 11.172 23.030 10.889 16.42 9.703 13.645
## 783.57649\t_\t21.621 42.31 26.370 59.969 34.928 48.82 37.872 42.715
## 834.60559\t_\t21.861 11.44 9.089 20.032 9.836 13.17 7.494 12.517
## 813.68440\t_\t22.221 87.68 116.902 108.099 92.528 124.28 116.061 104.213
## X1211 X1212 X1213 X1214 X2111 X2112 X2113
## 756.55692\t_\t21.459 42.101 34.268 30.152 32.969 25.615 29.545 30.16
## 759.64089\t_\t22.003 7.388 7.831 8.442 8.211 7.526 6.773 10.02
## 807.57778\t_\t21.597 16.473 18.128 15.740 13.842 13.814 13.281 15.52
## 783.57649\t_\t21.621 41.421 48.785 54.010 46.228 45.523 50.043 43.26
## 834.60559\t_\t21.861 13.707 15.951 12.011 10.495 10.177 10.234 11.94
## 813.68440\t_\t22.221 111.205 91.691 85.984 94.020 109.693 57.345 73.88
## X2114 X2115 X2211 X2212 X2213 X2214 X2215
## 756.55692\t_\t21.459 43.67 38.924 44.507 34.060 44.822 34.532 33.375
## 759.64089\t_\t22.003 10.30 5.777 9.392 9.189 8.448 8.141 9.898
## 807.57778\t_\t21.597 13.29 11.853 13.259 14.697 12.376 10.512 14.233
## 783.57649\t_\t21.621 37.27 36.039 39.974 50.265 38.171 29.468 45.278
## 834.60559\t_\t21.861 10.51 10.063 11.871 10.906 10.557 8.314 11.894
## 813.68440\t_\t22.221 118.93 85.403 105.167 96.083 99.385 94.045 97.955
# Only analyzing positive mode data (the majority)
Data imported successfully. Group numbers correspond to the following
X1 X2 X3 X4
First, I will make an expression set for the data. Expression sets are a convenient class of data storage for genomic data, and other large data sets, as it can hold the large data frame info along with co-variate information (pData).
# First make expression set to store factor data.
CreateExpressionSet <- function(expr, title) {
b <- grepl(".1...", names(expr)) # race
m <- grepl("..1..", names(expr)) # gender
s <- grepl("...0.", names(expr)) # smoker
pData <- data.frame(race = factor(b, levels = c(TRUE, FALSE), labels = c("Black",
"White")), gender = factor(m, levels = c(TRUE, FALSE), labels = c("Male",
"Female")), smoker = factor(s, levels = c(TRUE, FALSE), labels = c("Smoker",
"Non-smoker")), row.names = names(expr))
metadata <- data.frame(labelDescription = c("Metasubject race", "Metasubject gender",
"Metasubject smoking status"), row.names = c("race", "gender", "smoker"))
phenoData <- new("AnnotatedDataFrame", data = pData, varMetadata = metadata)
ExpressionSet(assayData = as.matrix(expr), phenoData = phenoData)
}
ds <- list()
ds$raw$title <- "Raw"
ds$raw$pos <- CreateExpressionSet(pos)
save(ds, file = "../data/metabolite.RData")
summary((matrix(exprs(ds$raw$pos))))
## V1
## Min. : 0.0
## 1st Qu.: 0.2
## Median : 0.4
## Mean : 3.5
## 3rd Qu.: 1.4
## Max. :517.6
sum(matrix(exprs(ds$raw$pos)) == 0)
## [1] 1129
PlotSets <- function(set) {
boxplot(exprs(set$pos), main = paste(set$title, "positive ions"))
}
PlotSets(ds$raw)
exprs(ds$raw$pos)[1:5, 1:5]
## X1101 X1102 X1103 X1104 X1201
## 756.55692\t_\t21.459 30.030 28.86 35.390 37.07 38.115
## 759.64089\t_\t22.003 8.576 10.69 8.899 10.16 8.242
## 807.57778\t_\t21.597 18.335 17.63 13.978 20.22 15.279
## 783.57649\t_\t21.621 55.250 58.21 38.004 47.40 39.860
## 834.60559\t_\t21.861 14.166 12.87 11.378 17.25 13.694
pca <- prcomp(exprs(ds$raw$pos))
ggbiplot(pca)
heatmap(as.matrix(pos), labRow = FALSE)
Next, I will try an unsupervised method for looking for trends. Clustering should show trends in our data. I will use the PAM (Partitioning Around Medoids) method, which uses medoids, or central features, to group closest features according to the distance matrix:
# Clustering One-minus-correlation distance matrix
r <- cor(as.matrix(pos))
dPos <- 1 - r
dPos[1:5, 1:5]
## X1101 X1102 X1103 X1104 X1201
## X1101 0.00000 0.08164 0.2791 0.15098 0.20865
## X1102 0.08164 0.00000 0.1296 0.07306 0.06823
## X1103 0.27911 0.12959 0.0000 0.12648 0.03750
## X1104 0.15098 0.07306 0.1265 0.00000 0.08729
## X1201 0.20865 0.06823 0.0375 0.08729 0.00000
# PAM clustering PAM, K=2 - Two medoids
pamK2 <- pam(as.dist(dPos), k = 2, diss = TRUE)
pamK2
## Medoids:
## ID
## [1,] "24" "X1213"
## [2,] "9" "X1205"
## Clustering vector:
## X1101 X1102 X1103 X1104 X1201 X1202 X1203 X1204 X1205 X2101 X2102 X2103
## 1 2 2 2 2 1 2 2 2 1 1 1
## X2104 X2201 X2202 X2203 X1111 X1112 X1113 X1114 X1115 X1211 X1212 X1213
## 1 2 1 2 1 1 1 1 2 1 1 1
## X1214 X2111 X2112 X2113 X2114 X2115 X2211 X2212 X2213 X2214 X2215
## 2 2 1 1 2 2 2 2 1 2 2
## Objective function:
## build swap
## 0.05962 0.05594
##
## Available components:
## [1] "medoids" "id.med" "clustering" "objective" "isolation"
## [6] "clusinfo" "silinfo" "diss" "call"
# PAM, K=3 - Three medoids
pamK3 <- pam(as.dist(dPos), k = 3, diss = TRUE)
# Graphical summaries
clusplot(dPos, pamK2$clustering, diss = TRUE, labels = 3, col.p = 1, main = "Lipidomic data, positive mode bivariate cluster plot for PAM \n K=2 clusters, one-minus-correlation distance")
plot(pamK2, which.plots = 2, main = "Lipidomic data, positive mode. Silhouette plot for PAM \n K=2 clusters, one-minus-correlation distance")
No great classifying ability is seen with two medoids. From the silouette plots, we see that the variables in the thousands, hundreds, and tens do not seem to be grouped in any pattern.
Now I try three medoids:
clusplot(dPos, pamK3$clustering, diss = TRUE, labels = 3, col.p = 1, main = "Lipidomic data, positive mode bivariate cluster plot for PAM \n K=3 clusters, one-minus-correlation distance")
plot(pamK3, which.plots = 2, main = "Lipidomic data, positive mode. Silhouette plot for PAM \n K=2 clusters, one-minus-correlation distance")
table(pamK2$clustering, colnames(pos))
##
## X1101 X1102 X1103 X1104 X1111 X1112 X1113 X1114 X1115 X1201 X1202
## 1 1 0 0 0 1 1 1 1 0 0 1
## 2 0 1 1 1 0 0 0 0 1 1 0
##
## X1203 X1204 X1205 X1211 X1212 X1213 X1214 X2101 X2102 X2103 X2104
## 1 0 0 0 1 1 1 0 1 1 1 1
## 2 1 1 1 0 0 0 1 0 0 0 0
##
## X2111 X2112 X2113 X2114 X2115 X2201 X2202 X2203 X2211 X2212 X2213
## 1 0 1 1 0 0 0 1 0 0 0 1
## 2 1 0 0 1 1 1 0 1 1 1 0
##
## X2214 X2215
## 1 0 0
## 2 1 1
table(pamK3$clustering, colnames(pos))
##
## X1101 X1102 X1103 X1104 X1111 X1112 X1113 X1114 X1115 X1201 X1202
## 1 1 0 0 0 1 1 1 1 0 0 1
## 2 0 1 0 1 0 0 0 0 1 0 0
## 3 0 0 1 0 0 0 0 0 0 1 0
##
## X1203 X1204 X1205 X1211 X1212 X1213 X1214 X2101 X2102 X2103 X2104
## 1 0 0 0 1 1 1 0 1 1 1 1
## 2 1 1 1 0 0 0 1 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0
##
## X2111 X2112 X2113 X2114 X2115 X2201 X2202 X2203 X2211 X2212 X2213
## 1 0 1 1 0 0 0 1 0 0 0 0
## 2 1 0 0 1 0 1 0 0 1 1 0
## 3 0 0 0 0 1 0 0 1 0 0 1
##
## X2214 X2215
## 1 0 0
## 2 1 1
## 3 0 0
Still not great clustering, but can see four out of five groups (with the same race, gender, smoking status covariates) are often clustered together. With 3000+ variables, variable picking would be a good idea.
Since most of the data is in the negative matrix- I will limit analysis to the positive ion data.
#LASSO
lasso <- function(set, fct) {
x <- t(exprs(set))
y <- phenoData(set)[[fct]]
glmmod <- glmnet(x = x, y = y, alpha = 1, family = "binomial")
plot(glmmod, xvar = "lambda")
cv.glmmod <- cv.glmnet(x = x, y = y, alpha = 1, family = "binomial")
plot(cv.glmmod)
best_lambda <- cv.glmmod$lambda.min
print(best_lambda)
beta <- as.vector(t(coef(cv.glmmod, s = "lambda.min")))
print(which(beta != 0))
}
layout(matrix(1:2, 1, 2, byrow = TRUE))
lassoGender <- lasso(ds$raw$pos, "gender")
## [1] 0.1466
## [1] 1 678 820 992 1347 2496
str(lassoGender)
## int [1:6] 1 678 820 992 1347 2496
lGenderVars <- lassoGender - 1
heatmap(as.matrix(pos[lGenderVars, ]))
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
lassoRace <- lasso(ds$raw$pos, "race")
## [1] 0.0142
## [1] 1 205 292 317 408 729 1015 1023 1102 1318 1447 1621 2050 2203
## [15] 2592 2675 2807
lRaceVars <- lassoRace - 1
heatmap(as.matrix(pos[lRaceVars, ]), labRow = FALSE)
`?`(heatmap)
lassoSmoker <- lasso(ds$raw$pos, "smoker")
## [1] 0.00499
## [1] 1 103 168 222 397 448 631 744 966 1236 1876 1895 2242 2304
## [15] 2347 2680 2768 2852
lSmokeVars <- lassoSmoker - 1
heatmap(as.matrix(pos[lSmokeVars, ]), labRow = FALSE)
Discussion
Among the plethora of different techniques for this kind of data, it can be difficult to decide which ones are most appropriate. By trying a few techniques, especially while a beginner, one can convince themselves of the semi-reproducibility of significant features. In other words, when trying to determine the most discriminating variables. These were consistent with the variables previous analyses determined with ANOVA and BH correction.
This can be interpretted that these methods find the similar differentiating variables, thus somewhat validating this type of analysis. Further studies are needed to see if these features may be causal. We plan to analyze serum samples from colon cancer patients and controls. The samples were taken approximately 20 years before disease status was known, so if we find differentiating features in these samples, biomarkers for cancer screening are possible.