Kelsi Perttula

Final Project PH 251D - part 2

Project 2 (draft due 11/26, final due 12/3)

Introduction

The recent development of several biochemical analytical platforms (DNA and RNA arrays, next-generation sequencing, proteomic MS, and metabolomic NMR and LCMS) hold immense potential for identifying biomarkers of disease. Although there is a relatively low number of subjects, the data provides thousands of variables, a characteristic common in many types of genomic and other “–omic” datasets. Methods utilzing these highly dimensional analyses is often cumbersome; statistical models are commonly subject to overfitting.

The goal of the current project was to generate metabolomic (lipid) profiles from the cohort. The data discussed here is for a proof-of-concept study that used blood serum samples from a previous cohort of healthy volunteers collected several years prior. The blood serum samples were collected from 158 individuals and later pooled into 35 groups with four or five individuals which were matched on gender, race, and smoking status. Using analytical chemistry techniques, specifically liquid chromotography mass spectrometry (LCMS), lipid profiles were generated from the pooled 35 samples.

The resulting data includes the mass of detected features in a serum sample as well as retention time, along with the relative absorbanc, which can be interpreted as concentration, of each feature. Over 3000 features were generated in the positive and negative mass spectormetry (MS) ionization modes, where each feature represenets a specific ion of a compound in the subjects' serum.

Biostatisticians at the University of California at Berkeley have analyzed our data using univariate analysis (ANOVA) with Benjamini-Hochberg correction for multiple tests. This analysis resulted in approximately 30 features that were shown to have biologically significant differences in concentration between the various groups.

For the current project, I will explore the previously described dataset through several statistical methods discussed in this course, including multi-variate analysis. The aim is to understand the trends within the dataset as well as to perform different analyses that would identify discriminating features among the thousands detected.

These statistical analyses may provide a framework for future studies of diseased and non-diseased subjects that are capable of detecting discriminating predictive biomarkers.

Demonstration

library(Biobase)  #several of these packages are from bioconductor
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, as.data.frame, cbind, colnames, duplicated,
##     eval, Filter, Find, get, intersect, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rep.int, rownames, sapply,
##     setdiff, sort, table, tapply, union, unique, unlist
## 
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
library(ggplot2)
library(ggbiplot)
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
library(cluster)
library(reshape2)
library(limma)
library(glmnet)
## Loading required package: Matrix
## Loading required package: lattice
## Loaded glmnet 1.9-5
library(lasso2)
## R Package to solve regression problems while imposing
##   an L1 constraint on the parameters. Based on S-plus Release 2.1
## Copyright (C) 1998, 1999
## Justin Lokhorst   <jlokhors@stats.adelaide.edu.au>
## Berwin A. Turlach <bturlach@stats.adelaide.edu.au>
## Bill Venables     <wvenable@stats.adelaide.edu.au>
## 
## Copyright (C) 2002
## Martin Maechler <maechler@stat.math.ethz.ch>

setwd("~/Dropbox/PH251D/final-project-1/")

pos <- read.csv("../data/xiaoming_FT_positive_final.csv", row.names = 1, nrows = 2862)
head(pos)
##                       X1101 X1102   X1103 X1104   X1201 X1202   X1203
## 756.55692\t_\t21.459 30.030 28.86  35.390 37.07  38.115 29.69  35.769
## 759.64089\t_\t22.003  8.576 10.69   8.899 10.16   8.242 12.42   9.116
## 807.57778\t_\t21.597 18.335 17.63  13.978 20.22  15.279 19.42  13.061
## 783.57649\t_\t21.621 55.250 58.21  38.004 47.40  39.860 56.98  35.046
## 834.60559\t_\t21.861 14.166 12.87  11.378 17.25  13.694 15.12  12.822
## 813.68440\t_\t22.221 88.078 71.84 103.003 94.74 108.334 84.81 123.152
##                       X1204  X1205  X2101  X2102  X2103   X2104  X2201
## 756.55692\t_\t21.459 29.148  31.05 37.000 35.381 37.866  37.173  42.47
## 759.64089\t_\t22.003  9.962  10.67  9.092  8.752  9.883   9.095  10.27
## 807.57778\t_\t21.597 13.084  19.69 17.741 14.581 12.764  16.625  14.01
## 783.57649\t_\t21.621 34.413  44.83 52.295 39.139 43.340  39.730  38.96
## 834.60559\t_\t21.861 10.900  17.73 13.052 10.924  9.808  13.485  11.19
## 813.68440\t_\t22.221 98.114 128.83 75.811 91.681 65.383 101.850 109.54
##                      X2202   X2203   X1111  X1112  X1113   X1114   X1115
## 756.55692\t_\t21.459 42.63  42.146  32.873 31.773  45.03  32.333  33.311
## 759.64089\t_\t22.003  8.93   9.839   8.519  5.664  10.53   7.225   8.318
## 807.57778\t_\t21.597 14.33  11.172  23.030 10.889  16.42   9.703  13.645
## 783.57649\t_\t21.621 42.31  26.370  59.969 34.928  48.82  37.872  42.715
## 834.60559\t_\t21.861 11.44   9.089  20.032  9.836  13.17   7.494  12.517
## 813.68440\t_\t22.221 87.68 116.902 108.099 92.528 124.28 116.061 104.213
##                        X1211  X1212  X1213  X1214   X2111  X2112 X2113
## 756.55692\t_\t21.459  42.101 34.268 30.152 32.969  25.615 29.545 30.16
## 759.64089\t_\t22.003   7.388  7.831  8.442  8.211   7.526  6.773 10.02
## 807.57778\t_\t21.597  16.473 18.128 15.740 13.842  13.814 13.281 15.52
## 783.57649\t_\t21.621  41.421 48.785 54.010 46.228  45.523 50.043 43.26
## 834.60559\t_\t21.861  13.707 15.951 12.011 10.495  10.177 10.234 11.94
## 813.68440\t_\t22.221 111.205 91.691 85.984 94.020 109.693 57.345 73.88
##                       X2114  X2115   X2211  X2212  X2213  X2214  X2215
## 756.55692\t_\t21.459  43.67 38.924  44.507 34.060 44.822 34.532 33.375
## 759.64089\t_\t22.003  10.30  5.777   9.392  9.189  8.448  8.141  9.898
## 807.57778\t_\t21.597  13.29 11.853  13.259 14.697 12.376 10.512 14.233
## 783.57649\t_\t21.621  37.27 36.039  39.974 50.265 38.171 29.468 45.278
## 834.60559\t_\t21.861  10.51 10.063  11.871 10.906 10.557  8.314 11.894
## 813.68440\t_\t22.221 118.93 85.403 105.167 96.083 99.385 94.045 97.955
# Only analyzing positive mode data (the majority)

Data imported successfully. Group numbers correspond to the following

X1 X2 X3 X4

First, I will make an expression set for the data. Expression sets are a convenient class of data storage for genomic data, and other large data sets, as it can hold the large data frame info along with co-variate information (pData).


# First make expression set to store factor data.
CreateExpressionSet <- function(expr, title) {
    b <- grepl(".1...", names(expr))  # race
    m <- grepl("..1..", names(expr))  # gender
    s <- grepl("...0.", names(expr))  # smoker

    pData <- data.frame(race = factor(b, levels = c(TRUE, FALSE), labels = c("Black", 
        "White")), gender = factor(m, levels = c(TRUE, FALSE), labels = c("Male", 
        "Female")), smoker = factor(s, levels = c(TRUE, FALSE), labels = c("Smoker", 
        "Non-smoker")), row.names = names(expr))

    metadata <- data.frame(labelDescription = c("Metasubject race", "Metasubject gender", 
        "Metasubject smoking status"), row.names = c("race", "gender", "smoker"))

    phenoData <- new("AnnotatedDataFrame", data = pData, varMetadata = metadata)

    ExpressionSet(assayData = as.matrix(expr), phenoData = phenoData)
}

ds <- list()

ds$raw$title <- "Raw"
ds$raw$pos <- CreateExpressionSet(pos)

save(ds, file = "../data/metabolite.RData")
summary((matrix(exprs(ds$raw$pos))))
##        V1       
##  Min.   :  0.0  
##  1st Qu.:  0.2  
##  Median :  0.4  
##  Mean   :  3.5  
##  3rd Qu.:  1.4  
##  Max.   :517.6

sum(matrix(exprs(ds$raw$pos)) == 0)
## [1] 1129

PlotSets <- function(set) {
    boxplot(exprs(set$pos), main = paste(set$title, "positive ions"))
}

PlotSets(ds$raw)

plot of chunk unnamed-chunk-2



exprs(ds$raw$pos)[1:5, 1:5]
##                       X1101 X1102  X1103 X1104  X1201
## 756.55692\t_\t21.459 30.030 28.86 35.390 37.07 38.115
## 759.64089\t_\t22.003  8.576 10.69  8.899 10.16  8.242
## 807.57778\t_\t21.597 18.335 17.63 13.978 20.22 15.279
## 783.57649\t_\t21.621 55.250 58.21 38.004 47.40 39.860
## 834.60559\t_\t21.861 14.166 12.87 11.378 17.25 13.694
pca <- prcomp(exprs(ds$raw$pos))

ggbiplot(pca)

plot of chunk unnamed-chunk-2

heatmap(as.matrix(pos), labRow = FALSE)

plot of chunk unnamed-chunk-2

Next, I will try an unsupervised method for looking for trends. Clustering should show trends in our data. I will use the PAM (Partitioning Around Medoids) method, which uses medoids, or central features, to group closest features according to the distance matrix:

# Clustering One-minus-correlation distance matrix
r <- cor(as.matrix(pos))
dPos <- 1 - r
dPos[1:5, 1:5]
##         X1101   X1102  X1103   X1104   X1201
## X1101 0.00000 0.08164 0.2791 0.15098 0.20865
## X1102 0.08164 0.00000 0.1296 0.07306 0.06823
## X1103 0.27911 0.12959 0.0000 0.12648 0.03750
## X1104 0.15098 0.07306 0.1265 0.00000 0.08729
## X1201 0.20865 0.06823 0.0375 0.08729 0.00000


# PAM clustering PAM, K=2 - Two medoids
pamK2 <- pam(as.dist(dPos), k = 2, diss = TRUE)
pamK2
## Medoids:
##      ID          
## [1,] "24" "X1213"
## [2,] "9"  "X1205"
## Clustering vector:
## X1101 X1102 X1103 X1104 X1201 X1202 X1203 X1204 X1205 X2101 X2102 X2103 
##     1     2     2     2     2     1     2     2     2     1     1     1 
## X2104 X2201 X2202 X2203 X1111 X1112 X1113 X1114 X1115 X1211 X1212 X1213 
##     1     2     1     2     1     1     1     1     2     1     1     1 
## X1214 X2111 X2112 X2113 X2114 X2115 X2211 X2212 X2213 X2214 X2215 
##     2     2     1     1     2     2     2     2     1     2     2 
## Objective function:
##   build    swap 
## 0.05962 0.05594 
## 
## Available components:
## [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
## [6] "clusinfo"   "silinfo"    "diss"       "call"
# PAM, K=3 - Three medoids
pamK3 <- pam(as.dist(dPos), k = 3, diss = TRUE)

# Graphical summaries

clusplot(dPos, pamK2$clustering, diss = TRUE, labels = 3, col.p = 1, main = "Lipidomic data, positive mode bivariate cluster plot for PAM \n K=2 clusters, one-minus-correlation distance")

plot of chunk unnamed-chunk-3

plot(pamK2, which.plots = 2, main = "Lipidomic data, positive mode. Silhouette plot for PAM \n K=2 clusters, one-minus-correlation distance")

plot of chunk unnamed-chunk-3

No great classifying ability is seen with two medoids. From the silouette plots, we see that the variables in the thousands, hundreds, and tens do not seem to be grouped in any pattern.

Now I try three medoids:

clusplot(dPos, pamK3$clustering, diss = TRUE, labels = 3, col.p = 1, main = "Lipidomic data, positive mode bivariate cluster plot for PAM \n K=3 clusters, one-minus-correlation distance")

plot of chunk unnamed-chunk-4

plot(pamK3, which.plots = 2, main = "Lipidomic data, positive mode. Silhouette plot for PAM \n K=2 clusters, one-minus-correlation distance")

plot of chunk unnamed-chunk-4


table(pamK2$clustering, colnames(pos))
##    
##     X1101 X1102 X1103 X1104 X1111 X1112 X1113 X1114 X1115 X1201 X1202
##   1     1     0     0     0     1     1     1     1     0     0     1
##   2     0     1     1     1     0     0     0     0     1     1     0
##    
##     X1203 X1204 X1205 X1211 X1212 X1213 X1214 X2101 X2102 X2103 X2104
##   1     0     0     0     1     1     1     0     1     1     1     1
##   2     1     1     1     0     0     0     1     0     0     0     0
##    
##     X2111 X2112 X2113 X2114 X2115 X2201 X2202 X2203 X2211 X2212 X2213
##   1     0     1     1     0     0     0     1     0     0     0     1
##   2     1     0     0     1     1     1     0     1     1     1     0
##    
##     X2214 X2215
##   1     0     0
##   2     1     1
table(pamK3$clustering, colnames(pos))
##    
##     X1101 X1102 X1103 X1104 X1111 X1112 X1113 X1114 X1115 X1201 X1202
##   1     1     0     0     0     1     1     1     1     0     0     1
##   2     0     1     0     1     0     0     0     0     1     0     0
##   3     0     0     1     0     0     0     0     0     0     1     0
##    
##     X1203 X1204 X1205 X1211 X1212 X1213 X1214 X2101 X2102 X2103 X2104
##   1     0     0     0     1     1     1     0     1     1     1     1
##   2     1     1     1     0     0     0     1     0     0     0     0
##   3     0     0     0     0     0     0     0     0     0     0     0
##    
##     X2111 X2112 X2113 X2114 X2115 X2201 X2202 X2203 X2211 X2212 X2213
##   1     0     1     1     0     0     0     1     0     0     0     0
##   2     1     0     0     1     0     1     0     0     1     1     0
##   3     0     0     0     0     1     0     0     1     0     0     1
##    
##     X2214 X2215
##   1     0     0
##   2     1     1
##   3     0     0

Still not great clustering, but can see four out of five groups (with the same race, gender, smoking status covariates) are often clustered together. With 3000+ variables, variable picking would be a good idea.

Since most of the data is in the negative matrix- I will limit analysis to the positive ion data.

#LASSO

lasso <- function(set, fct) {
    x <- t(exprs(set))
    y <- phenoData(set)[[fct]]
    glmmod <- glmnet(x = x, y = y, alpha = 1, family = "binomial")
    plot(glmmod, xvar = "lambda")

    cv.glmmod <- cv.glmnet(x = x, y = y, alpha = 1, family = "binomial")
    plot(cv.glmmod)
    best_lambda <- cv.glmmod$lambda.min
    print(best_lambda)

    beta <- as.vector(t(coef(cv.glmmod, s = "lambda.min")))
    print(which(beta != 0))
}
layout(matrix(1:2, 1, 2, byrow = TRUE))
lassoGender <- lasso(ds$raw$pos, "gender")

plot of chunk unnamed-chunk-5

## [1] 0.1466
## [1]    1  678  820  992 1347 2496
str(lassoGender)
##  int [1:6] 1 678 820 992 1347 2496
lGenderVars <- lassoGender - 1
heatmap(as.matrix(pos[lGenderVars, ]))
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9
## Warning: font width unknown for character 0x9

plot of chunk unnamed-chunk-5



lassoRace <- lasso(ds$raw$pos, "race")

plot of chunk unnamed-chunk-5

## [1] 0.0142
##  [1]    1  205  292  317  408  729 1015 1023 1102 1318 1447 1621 2050 2203
## [15] 2592 2675 2807
lRaceVars <- lassoRace - 1
heatmap(as.matrix(pos[lRaceVars, ]), labRow = FALSE)

plot of chunk unnamed-chunk-5

`?`(heatmap)

lassoSmoker <- lasso(ds$raw$pos, "smoker")

plot of chunk unnamed-chunk-5

## [1] 0.00499
##  [1]    1  103  168  222  397  448  631  744  966 1236 1876 1895 2242 2304
## [15] 2347 2680 2768 2852
lSmokeVars <- lassoSmoker - 1
heatmap(as.matrix(pos[lSmokeVars, ]), labRow = FALSE)

plot of chunk unnamed-chunk-5


Discussion

Among the plethora of different techniques for this kind of data, it can be difficult to decide which ones are most appropriate. By trying a few techniques, especially while a beginner, one can convince themselves of the semi-reproducibility of significant features. In other words, when trying to determine the most discriminating variables. These were consistent with the variables previous analyses determined with ANOVA and BH correction.

This can be interpretted that these methods find the similar differentiating variables, thus somewhat validating this type of analysis. Further studies are needed to see if these features may be causal. We plan to analyze serum samples from colon cancer patients and controls. The samples were taken approximately 20 years before disease status was known, so if we find differentiating features in these samples, biomarkers for cancer screening are possible.