Supervised and Unsupervised Statistical Learning

PER BROBERG
27NOV2014

Introduction

  • In empirical sciences we try to learn from data
  • Statistics provides tools for doing so
  • We either know the type (eg. healthy/diseased) of our samples and want to describe the types
    • supervised methods
  • Or, we do not know the types of our samples and want to discover or predict
    • unsupervised methods

Learning outcomes

  • Define Multivariate Analysis
  • Recognize the value of Principal Components Analysis and Multidimensional Scaling
  • Be introduced to Linear Discriminant Analysis and Logistic Regression
  • Examples of how the free statistical software R might help
  • Do check this software out, and try the code enclosed

Preliminaries (1)

  • Here we will make use of contiuous data, as opposed to discrete.
  • Normal distribution

    • A commonly used distribution that is defined through its mean and variance (or \( STD = \sqrt{VAR} \))
    • Mean values often roughly follow a normal distribution

    plot of chunk unnamed-chunk-1

Preliminaries (2)

  • (Pearson) Correlation
    • Measures degree of linear relation between variables (\( -1 \le \rho \le 1 \))
    • Measures similarity in a particular sense
    • May be turned into a dissimilarity through \( dis = 1-\rho^{2} \).

plot of chunk unnamed-chunk-2

Information overload

  • Two quotations in The Elements of Statistical Learning
    • “We are drowning in information and starving for knowledge” (Rutherford D. Roger)
    • “The quiet statisticians have changed our world;not by discovering new facts or technical developments, but by changing the ways that we reason, experiment and form our opinions” (Ian Hacking)

Multivariate Analysis (MVA)

  • MVA: observation and analysis of more than one outcome variable
  • Enables us to work with dependencies between variables
  • Distinguish from multivariable, as in multivariable regression
    • one outcome, several explanatory variables

We will look at the Iris dataset

  • This famous iris data set gives the measurements in centimeters of the variables sepal length (SL) and width (SW) and petal length (PL) and width (PW), respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
  • The first few lines look like the following
   SL  SW  PL  PW Species
1 5.1 3.5 1.4 0.2  setosa
2 4.9 3.0 1.4 0.2  setosa
3 4.7 3.2 1.3 0.2  setosa
4 4.6 3.1 1.5 0.2  setosa
5 5.0 3.6 1.4 0.2  setosa
6 5.4 3.9 1.7 0.4  setosa

A picture is worth a thousand words

  • The Iris dataset exhibits some interesting multivariate patterns
  • Here it is quite possible to analyse in depth the individual variables

plot of chunk unnamed-chunk-4

Let's look at a high dimensional example

alt text

A well-known paper from Science will serve as illustration. 38 leukemia samples , 11 of AML type and 27 of ALL type, have been analysed on a microarray platform.

Microarrays

alt text

Through gene expression profiling the expression levels of thousands of genes are simultaneously monitored. Each so called probe on the chip assays the level of a particular mRNA sequence.

What is R ?

  • R is a free software environment for statistical computing and graphics.
  • A Graphical User Interface called Rstudio makes it easier (but not always easy) to use
  • Suggestion:
    • Install R
    • Install the lastest version of RStudio

Now load the gene expression data

### load packages with data and tools that we need ###
library("SAGx")
library("multtest")
data(golub)
dim(golub) # 3051 rows (probes) and 38 columns (samples)
[1] 3051   38

More detail

head(golub[, 1:4], n = 5) ### show four columns and five rows ###
         [,1]     [,2]     [,3]     [,4]
[1,] -1.45769 -1.39420 -1.42779 -1.40715
[2,] -0.75161 -1.26278 -0.09052 -0.99596
[3,]  0.45695 -0.09654  0.90325 -0.07194
[4,]  3.13533  0.21415  2.08754  2.23467
[5,]  2.76569 -1.27045  1.60433  1.53182
table(golub.cl)  ### the vector goub.cl gives the class of samples ###
golub.cl
 0  1 
27 11 

What do we know ?

  • There are 3051 probes (variables)
  • The 38 samples fall into two groups of sizes 27 (ALL) and 11 (AML)
  • The article found a way to discriminate between ALL and AML
  • But the dataset is too large to be analysed manually. What can we do ?

Idea

  • Show the data in two dimensions with the samples colour coded: ALL = yellow, AML = blue
  • The original data has dimension 3051 (= number of probes)
  • How do we reduce the number of dimensione to two ?
  • We could select two probes, skip the rest, and plot the data
  • But that would waste a lot of information…

Seeing a teat pot (1)

alt text

Borrowed from James X. Li on Youtube https://www.youtube.com/watch?v=BfTMmoDFXyE

Seeing a teat pot (2)

alt text

Some angles give a better view than others

Seeing a teat pot (3)

alt text

Basically, we want catch a glimps of distinctive features

Seeing a teat pot (4)

alt text

Seeing a teat pot (5)

alt text

Seeing a teat pot (6)

alt text

Seeing a teat pot (7)

alt text

Seeing a teat pot (8)

alt text

Seeing a teat pot (9)

alt text

Seeing a teat pot (10)

alt text

Seeing a teat pot (11)

alt text

Principal Components Analysis (PCA)

  • The basic idea is to combine the given dimensions (3051) into new ones such that the first new one picks up as much information as possible
  • The second new dimension will then optimaly capture the remaining information, and at the same time be independent (orthogonal to) of the first
  • And so forth to define principal components PC1, PC2, etc.

PCA (cont. I)

  • Actually, there cannot be more than \( min(3051, 38) = 38 \) independent dimensions in the data (This follows from Linear Algebra Theory, beyond the scope of this presentation)
  • Dimensions where there is no variation provide no information
  • Dimensions that change a lot provide information
  • The first principal component (PC) is found by drawing a line in space in direction of the largest variation

PCA (cont. II)

  • PCA does not make use of class information
    • It is unsupervised
    • It allows us to discover patterns in data
    • If we want to model differences between healthy and diseased we need a supervised method

Preparation

  • Turn the dataset so that variables are in columns
  • Subtract mean and divide by standard deviation to make variables comparable
    • Centre and scale
    • For instance the measurement units (Fahrenheit or Celsius) should not change conclusions

A first plot based on PC1 and PC2

pca.golub <- prcomp(t(golub), center = T, scale = T) 
mycolors <- c("blue", "gold") # Define plotting colors.
colors <- (golub.cl==0)+1
plot(pca.golub$x, pch=25, col=mycolors[colors], main = "Figure 1") 

plot of chunk unnamed-chunk-7

Summary of PCA

Display summary information about the first three PCs

round(summary(pca.golub)$importance[,1:3], digits = 3)
                          PC1    PC2    PC3
Standard deviation     21.796 16.933 14.283
Proportion of Variance  0.156  0.094  0.067
Cumulative Proportion   0.156  0.250  0.317

How much information is captured in the first two PCs ?

We see that already PC1 picks much more than PC2

### These variances for the PCs are related to the so called Eigenvalues ###
plot(pca.golub) 

plot of chunk unnamed-chunk-9

PCA in Respiratory Research

  • Non-smokers (triangles), Healthy Smokers (cirlcles), Chronic Obstructive Pulmonary Disease (squares)

alt text

  • Gene expression of 122 oxidative stress related genes

Your turn (1), First discuss with your neighbour

  • 1. Why MVA ?
    • a) Modern high-throughput technologies require special statistical methods
    • b) The signal-to-noise ratio increases after reduction of dimensions
    • c) Correlated data can only be assessed through MVA
  • 2. Which is true regarding PCA
    • a) PCA requires normal distribution
    • b) PCA tend to keep the gross features of data
    • c) PCA is a method for testing hypotheses

Your turn (2), First discuss with your neighbour

  • 1. What is the percentage of variance explained in Fig. 1 ?
    • a) 15%
    • b) 25%
    • c) 40%
  • 2. Which is true regarding PCA
    • a) PCA assumes that variation equals information
    • b) PCA can find as many dimensions as you like
    • c) The first dimension is the most important

Further uses of MVA

The mapping of observarions to the plane could use

  • Correlations, or other forms of measures of similarity
  • Dissimilarities. But the two are related
    • Example : Dissimilarity = 1- Similarity

Example of Dissimilarity

  • Distances
### Road distances in Europe (km) ###
### Create a distance matrix from a distance object ###
distances <- as.matrix(eurodist)
### Show the first four rows and columns ###
head(distances[,1:4], n = 4)
          Athens Barcelona Brussels Calais
Athens         0      3313     2963   3175
Barcelona   3313         0     1318   1326
Brussels    2963      1318        0    204
Calais      3175      1326      204      0

A matrix is set of numbers laid out in rows and columns

Multidimensional Scaling (MDS)

We may visualise the eurodist data with MDS

The R function cmdscale performs MDS

euro.mds <- cmdscale(eurodist)
### Use the first two dimensions ###
Dim1 <- euro.mds [,1]
Dim2 <- euro.mds [,2]

MDS (continued)

plot(Dim1, Dim2, type="n", xlab="", ylab="", main="cmdscale(eurodist)")
segments(-1500, -0, 1500, 0, lty="dotted")
segments(0, -1500, 0, 1500, lty="dotted")
text(Dim1, Dim2, rownames(euro.mds), cex=0.8, col="red")

plot of chunk unnamed-chunk-12

Example of Biological Dissimilarities

  • Take two DNA sequences:
    • ATGCGTTAAATTGGCG CCGAATAT
    • ATG TTTCGATAGGCGTCTGAATAT

Dissimilarity between DNA sequences

  • Hamming distance = Number of differences
  • p-distance
  • Jukes-Cantor nucleotide distance

alt text

Linear Discriminant Analysis (LDA)

  • Supervised method that finds dimensions that separated given groups (e.g. Diseased - Healthy) as well as possble
  • Works best for normal distribution
  • This gives a model to classify new observations
  • Diagnostic tool
  • Predicts the a new object as belonging to the class which is closest in terms of a weighted distance (Mahalanobis)

PCA: points projected to PC1

plot of chunk unnamed-chunk-13

LDA: points projected to discriminant

plot of chunk unnamed-chunk-14

PCA vs LDA

  • Look at these two plots
    • The percentage between group variance explained within brackets
    • Which method works best? Ref

plot of chunk unnamed-chunk-15

Logistic Regression (LR)

  • In order to separate two groups, model the probability of belonging to one class
    • Model: \( Log Odds = ln(P/(1-P)) = \mu + \beta_{1}x_{1}+\beta_{2}x_{2}\ldots \)
  • Build a model distinguishing between Virginica and Versicolor using Petal Width and Length. Versicolor in blue. Decision boundary indicated: indicates \( P=0.5 \).

plot of chunk unnamed-chunk-16

LR performance

  • Benchmark classifier
  • Related to Neural Networks

Sensitivity and Specificity

  • Classify as Virginica if \( P > Cutoff \).
    • Sensitivity: frequency of predicting a virginica (case) correctly
    • Specificty: frequency of predicting a versicolor (non-case) correctly
  • Ref

plot of chunk unnamed-chunk-17

Caveat

  • Risk of overfitting
    • Results may fit our dataset quite well
    • But our data may not be representative
  • Model has to be validated in external data (new cohort, study, experiment)

Your turn (3a), First discuss with your neighbour

  • 1. What is similar between PCA and LDA ?
    • a) Both assume normal distribution
    • b) Both may reduce the number of dimensions
    • c) Neither
  • 2. What is fundamentally different in terms of purpose ?
    • a) PCA does not assume any particular distribution
    • b) LDA assumes normal distribution
    • c) LDA assumes knowledge of the category of each observation

Your turn (3b), First discuss with your neighbour

  • 3. What is the difference between LDA and LR
    • a) LDA assumes normal distribution
    • b) LR models probability while LDA uses distance
    • c) LR assumes two groups

Software

  • R
  • SPSS
  • SAS
  • Excel (some)
  • Matlab
  • Qlucore Omics Explorer

Summary

  • Methods fall into supervised and unsupervised
  • PCA offers a model free overview using correlation (unsupervised)
  • MDS translates general distances into a map (unsupervised)
  • LDA build a predictor using known class label, assuming normal distribution , and classifies by the distance to class centroids (supervised)
  • Logistic regression models the probability of belonging to one group rather than another by assuming a linear relationship between logodds and predictive variables (supervised)

Piece of advice for Researchers

  • Talk to a Statistician before you start your research
  • See how others in your field have presented data
    • Put emphasis on high impact publications when you look for inspiration

Post scriptum (1)

  • Distance. Euclidean distance from \( (0,0) \) to \( (x, y) \) is \( \sqrt{x^{2}+y^{2}} \).

Post scriptum (2)

  • How to install Bioconductor
    • Follow instructions here
    • To install an individual package use the function biocLite in the BiocInstaller package. E.g. biocLite(“SAGx”)