Cross-Validation

Lucas Schiffer
Feburay 21, 2017

Data Analysis for the Life Sciences

Overview

  • Foundations of Cross-Validation
  • Rubber Ducks Example
  • Gene Expression / Tissue Example
  • Exercises

Foundations of cross-validation

  • Machine learning technique
  • Split dataset into N folds
  • Train on \( N-1 \) folds, \( N \) times
  • Find an optimal value for \( M \) parameters
  • Will have \( N \times M \) iterations

Rubber Ducks Example

Rubber Ducks Example

ducks <- c("yellow", "yellow", "red", "yellow", "yellow", "yellow")
colors <- c("red", "green", "blue", "cyan", "magenta", "yellow")


ducks %<>%
  rep(50)

N <-
  ducks %>%
  rep(50) %>%
  createFolds(10)

M <-
  colors

Gene Expression / Tissue Example

data(tissuesGeneExpression)
ind <- which(tissue != "placenta")
y <- tissue[ind]
X <- t(e[, ind])
set.seed(1)
idx <- createFolds(y, k = 10)

plot of chunk unnamed-chunk-5

Gene Expression / Tissue Example

set.seed(1)
ks <- 1:12
res <- sapply(ks, function(k) {
    res.k <- sapply(seq_along(idx), function(i) {
        pred <- knn(train = Xsmall[-idx[[i]], ],
                    test = Xsmall[idx[[i]], ],
                    cl = y[-idx[[i]]], k = k)
        mean(y[idx[[i]]] != pred)
    })
    mean(res.k)
})

Gene Expression / Tissue Example

plot of chunk unnamed-chunk-7

Gene Expression / Tissue Example

Xsmall <- cmdscale(dist(X), k = 5)
set.seed(1)
ks <- 1:12
res <- sapply(ks, function(k) {
    res.k <- sapply(seq_along(idx), function(i) {
        pred <- knn(train = Xsmall[-idx[[i]], ],
                    test = Xsmall[idx[[i]], ],
                    cl = y[-idx[[i]]], k = k)
        mean(y[idx[[i]]] != pred)
    })
    mean(res.k)
})

Gene Expression / Tissue Example

plot of chunk unnamed-chunk-9

On to the exercises