Supervised and Unsupervised Statistical Learning

PER BROBERG
27NOV2014

Introduction

In empirical sciences we try to learn from data
Statistics provides tools for doing so
We either know the type (eg. healthy/diseased) of our samples and want to describe the types
- supervised methods
Or, we do not know the types of our samples and want to discover or predict
- unsupervised methods

Learning outcomes

Define Multivariate Analysis
Recognize the value of Principal Components Analysis and Multidimensional Scaling
Be introduced to Linear Discriminant Analysis and Logistic Regression
Examples of how the free statistical software R might help
Do check this software out, and try the code enclosed

Preliminaries (1)

Here we will make use of contiuous data, as opposed to discrete.
Normal distribution
- A commonly used distribution that is defined through its mean and variance (or \( STD = \sqrt{VAR} \))
- Mean values often roughly follow a normal distribution

Preliminaries (2)

(Pearson) Correlation
- Measures degree of linear relation between variables (\( -1 \le \rho \le 1 \))
- Measures similarity in a particular sense
- May be turned into a dissimilarity through \( dis = 1-\rho^{2} \).

plot of chunk unnamed-chunk-2

Information overload

Two quotations in The Elements of Statistical Learning
- “We are drowning in information and starving for knowledge” (Rutherford D. Roger)
- “The quiet statisticians have changed our world;not by discovering new facts or technical developments, but by changing the ways that we reason, experiment and form our opinions” (Ian Hacking)

Multivariate Analysis (MVA)

MVA: observation and analysis of more than one outcome variable
Enables us to work with dependencies between variables
Distinguish from multivariable, as in multivariable regression
- one outcome, several explanatory variables

We will look at the Iris dataset

This famous iris data set gives the measurements in centimeters of the variables sepal length (SL) and width (SW) and petal length (PL) and width (PW), respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
The first few lines look like the following

   SL  SW  PL  PW Species
1 5.1 3.5 1.4 0.2  setosa
2 4.9 3.0 1.4 0.2  setosa
3 4.7 3.2 1.3 0.2  setosa
4 4.6 3.1 1.5 0.2  setosa
5 5.0 3.6 1.4 0.2  setosa
6 5.4 3.9 1.7 0.4  setosa

A picture is worth a thousand words

The Iris dataset exhibits some interesting multivariate patterns
Here it is quite possible to analyse in depth the individual variables

plot of chunk unnamed-chunk-4

Let's look at a high dimensional example

alt text

A well-known paper from Science will serve as illustration. 38 leukemia samples , 11 of AML type and 27 of ALL type, have been analysed on a microarray platform.

Microarrays

alt text

Through gene expression profiling the expression levels of thousands of genes are simultaneously monitored. Each so called probe on the chip assays the level of a particular mRNA sequence.

What is R ?

R is a free software environment for statistical computing and graphics.
A Graphical User Interface called Rstudio makes it easier (but not always easy) to use
Suggestion:
- Install R
- Install the lastest version of RStudio

Now load the gene expression data

### load packages with data and tools that we need ###
library("SAGx")
library("multtest")
data(golub)
dim(golub) # 3051 rows (probes) and 38 columns (samples)

[1] 3051   38

More detail

head(golub[, 1:4], n = 5) ### show four columns and five rows ###

         [,1]     [,2]     [,3]     [,4]
[1,] -1.45769 -1.39420 -1.42779 -1.40715
[2,] -0.75161 -1.26278 -0.09052 -0.99596
[3,]  0.45695 -0.09654  0.90325 -0.07194
[4,]  3.13533  0.21415  2.08754  2.23467
[5,]  2.76569 -1.27045  1.60433  1.53182

table(golub.cl)  ### the vector goub.cl gives the class of samples ###

golub.cl
 0  1 
27 11

What do we know ?

There are 3051 probes (variables)
The 38 samples fall into two groups of sizes 27 (ALL) and 11 (AML)
The article found a way to discriminate between ALL and AML
But the dataset is too large to be analysed manually. What can we do ?

Idea

Show the data in two dimensions with the samples colour coded: ALL = yellow, AML = blue
The original data has dimension 3051 (= number of probes)
How do we reduce the number of dimensione to two ?
We could select two probes, skip the rest, and plot the data
But that would waste a lot of information…

Seeing a teat pot (1)

alt text

Borrowed from James X. Li on Youtube https://www.youtube.com/watch?v=BfTMmoDFXyE

Seeing a teat pot (2)

alt text

Some angles give a better view than others

Seeing a teat pot (3)

alt text

Basically, we want catch a glimps of distinctive features

Seeing a teat pot (4)

alt text

Seeing a teat pot (5)

alt text

Seeing a teat pot (6)

alt text

Seeing a teat pot (7)

alt text

Seeing a teat pot (8)

alt text

Seeing a teat pot (9)

alt text

Seeing a teat pot (10)

alt text

Seeing a teat pot (11)

alt text

Principal Components Analysis (PCA)

The basic idea is to combine the given dimensions (3051) into new ones such that the first new one picks up as much information as possible
The second new dimension will then optimaly capture the remaining information, and at the same time be independent (orthogonal to) of the first
And so forth to define principal components PC1, PC2, etc.

PCA (cont. I)

Actually, there cannot be more than \( min(3051, 38) = 38 \) independent dimensions in the data (This follows from Linear Algebra Theory, beyond the scope of this presentation)
Dimensions where there is no variation provide no information
Dimensions that change a lot provide information
The first principal component (PC) is found by drawing a line in space in direction of the largest variation

PCA (cont. II)

PCA does not make use of class information
- It is unsupervised
- It allows us to discover patterns in data
- If we want to model differences between healthy and diseased we need a supervised method

Preparation

Turn the dataset so that variables are in columns
Subtract mean and divide by standard deviation to make variables comparable
- Centre and scale
- For instance the measurement units (Fahrenheit or Celsius) should not change conclusions

A first plot based on PC1 and PC2

pca.golub <- prcomp(t(golub), center = T, scale = T) 
mycolors <- c("blue", "gold") # Define plotting colors.
colors <- (golub.cl==0)+1
plot(pca.golub$x, pch=25, col=mycolors[colors], main = "Figure 1")

plot of chunk unnamed-chunk-7

Summary of PCA

Display summary information about the first three PCs

round(summary(pca.golub)$importance[,1:3], digits = 3)

                          PC1    PC2    PC3
Standard deviation     21.796 16.933 14.283
Proportion of Variance  0.156  0.094  0.067
Cumulative Proportion   0.156  0.250  0.317

How much information is captured in the first two PCs ?

We see that already PC1 picks much more than PC2

### These variances for the PCs are related to the so called Eigenvalues ###
plot(pca.golub)

plot of chunk unnamed-chunk-9

PCA in Respiratory Research

Non-smokers (triangles), Healthy Smokers (cirlcles), Chronic Obstructive Pulmonary Disease (squares)

alt text

Gene expression of 122 oxidative stress related genes

Your turn (1), First discuss with your neighbour

1. Why MVA ?
- a) Modern high-throughput technologies require special statistical methods
- b) The signal-to-noise ratio increases after reduction of dimensions
- c) Correlated data can only be assessed through MVA
2. Which is true regarding PCA
- a) PCA requires normal distribution
- b) PCA tend to keep the gross features of data
- c) PCA is a method for testing hypotheses

Your turn (2), First discuss with your neighbour

1. What is the percentage of variance explained in Fig. 1 ?
- a) 15%
- b) 25%
- c) 40%
2. Which is true regarding PCA
- a) PCA assumes that variation equals information
- b) PCA can find as many dimensions as you like
- c) The first dimension is the most important

Further uses of MVA

The mapping of observarions to the plane could use

Correlations, or other forms of measures of similarity
Dissimilarities. But the two are related
- Example : Dissimilarity = 1- Similarity

Example of Dissimilarity

Distances

### Road distances in Europe (km) ###
### Create a distance matrix from a distance object ###
distances <- as.matrix(eurodist)
### Show the first four rows and columns ###
head(distances[,1:4], n = 4)

          Athens Barcelona Brussels Calais
Athens         0      3313     2963   3175
Barcelona   3313         0     1318   1326
Brussels    2963      1318        0    204
Calais      3175      1326      204      0

A matrix is set of numbers laid out in rows and columns

Multidimensional Scaling (MDS)

We may visualise the eurodist data with MDS

The R function cmdscale performs MDS

euro.mds <- cmdscale(eurodist)
### Use the first two dimensions ###
Dim1 <- euro.mds [,1]
Dim2 <- euro.mds [,2]

MDS (continued)

plot(Dim1, Dim2, type="n", xlab="", ylab="", main="cmdscale(eurodist)")
segments(-1500, -0, 1500, 0, lty="dotted")
segments(0, -1500, 0, 1500, lty="dotted")
text(Dim1, Dim2, rownames(euro.mds), cex=0.8, col="red")

plot of chunk unnamed-chunk-12

Example of Biological Dissimilarities

Take two DNA sequences:
- ATGCGTTAAATTGGCG CCGAATAT
- ATG TTTCGATAGGCGTCTGAATAT

Dissimilarity between DNA sequences

Hamming distance = Number of differences
p-distance
Jukes-Cantor nucleotide distance

alt text

Linear Discriminant Analysis (LDA)

Supervised method that finds dimensions that separated given groups (e.g. Diseased - Healthy) as well as possble
Works best for normal distribution
This gives a model to classify new observations
Diagnostic tool
Predicts the a new object as belonging to the class which is closest in terms of a weighted distance (Mahalanobis)

PCA: points projected to PC1

plot of chunk unnamed-chunk-13

LDA: points projected to discriminant

plot of chunk unnamed-chunk-14

PCA vs LDA

Look at these two plots
- The percentage between group variance explained within brackets
- Which method works best? Ref

plot of chunk unnamed-chunk-15

Logistic Regression (LR)

In order to separate two groups, model the probability of belonging to one class
- Model: \( Log Odds = ln(P/(1-P)) = \mu + \beta_{1}x_{1}+\beta_{2}x_{2}\ldots \)
Build a model distinguishing between Virginica and Versicolor using Petal Width and Length. Versicolor in blue. Decision boundary indicated: indicates \( P=0.5 \).

plot of chunk unnamed-chunk-16

LR performance

Benchmark classifier
Related to Neural Networks

Sensitivity and Specificity

Classify as Virginica if \( P > Cutoff \).
- Sensitivity: frequency of predicting a virginica (case) correctly
- Specificty: frequency of predicting a versicolor (non-case) correctly
Ref

plot of chunk unnamed-chunk-17

Caveat

Risk of overfitting
- Results may fit our dataset quite well
- But our data may not be representative
Model has to be validated in external data (new cohort, study, experiment)

Your turn (3a), First discuss with your neighbour

1. What is similar between PCA and LDA ?
- a) Both assume normal distribution
- b) Both may reduce the number of dimensions
- c) Neither
2. What is fundamentally different in terms of purpose ?
- a) PCA does not assume any particular distribution
- b) LDA assumes normal distribution
- c) LDA assumes knowledge of the category of each observation

Your turn (3b), First discuss with your neighbour

3. What is the difference between LDA and LR
- a) LDA assumes normal distribution
- b) LR models probability while LDA uses distance
- c) LR assumes two groups

Software

R
- Bioconductor
SPSS
SAS
Excel (some)
Matlab
Qlucore Omics Explorer

Summary

Methods fall into supervised and unsupervised
PCA offers a model free overview using correlation (unsupervised)
MDS translates general distances into a map (unsupervised)
LDA build a predictor using known class label, assuming normal distribution , and classifies by the distance to class centroids (supervised)
Logistic regression models the probability of belonging to one group rather than another by assuming a linear relationship between logodds and predictive variables (supervised)

Piece of advice for Researchers

Talk to a Statistician before you start your research
See how others in your field have presented data
- Put emphasis on high impact publications when you look for inspiration

Post scriptum (1)

Distance. Euclidean distance from \( (0,0) \) to \( (x, y) \) is \( \sqrt{x^{2}+y^{2}} \).

Post scriptum (2)

How to install Bioconductor
- Follow instructions here
- To install an individual package use the function biocLite in the BiocInstaller package. E.g. biocLite(“SAGx”)