PER BROBERG
27NOV2014
Normal distribution
SL SW PL PW Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
A well-known paper from Science will serve as illustration. 38 leukemia samples , 11 of AML type and 27 of ALL type, have been analysed on a microarray platform.
Through gene expression profiling the expression levels of thousands of genes are simultaneously monitored. Each so called probe on the chip assays the level of a particular mRNA sequence.
### load packages with data and tools that we need ###
library("SAGx")
library("multtest")
data(golub)
dim(golub) # 3051 rows (probes) and 38 columns (samples)
[1] 3051 38
head(golub[, 1:4], n = 5) ### show four columns and five rows ###
[,1] [,2] [,3] [,4]
[1,] -1.45769 -1.39420 -1.42779 -1.40715
[2,] -0.75161 -1.26278 -0.09052 -0.99596
[3,] 0.45695 -0.09654 0.90325 -0.07194
[4,] 3.13533 0.21415 2.08754 2.23467
[5,] 2.76569 -1.27045 1.60433 1.53182
table(golub.cl) ### the vector goub.cl gives the class of samples ###
golub.cl
0 1
27 11
Borrowed from James X. Li on Youtube https://www.youtube.com/watch?v=BfTMmoDFXyE
Some angles give a better view than others
Basically, we want catch a glimps of distinctive features
pca.golub <- prcomp(t(golub), center = T, scale = T)
mycolors <- c("blue", "gold") # Define plotting colors.
colors <- (golub.cl==0)+1
plot(pca.golub$x, pch=25, col=mycolors[colors], main = "Figure 1")
Display summary information about the first three PCs
round(summary(pca.golub)$importance[,1:3], digits = 3)
PC1 PC2 PC3
Standard deviation 21.796 16.933 14.283
Proportion of Variance 0.156 0.094 0.067
Cumulative Proportion 0.156 0.250 0.317
We see that already PC1 picks much more than PC2
### These variances for the PCs are related to the so called Eigenvalues ###
plot(pca.golub)
The mapping of observarions to the plane could use
### Road distances in Europe (km) ###
### Create a distance matrix from a distance object ###
distances <- as.matrix(eurodist)
### Show the first four rows and columns ###
head(distances[,1:4], n = 4)
Athens Barcelona Brussels Calais
Athens 0 3313 2963 3175
Barcelona 3313 0 1318 1326
Brussels 2963 1318 0 204
Calais 3175 1326 204 0
A matrix is set of numbers laid out in rows and columns
We may visualise the eurodist data with MDS
The R function cmdscale performs MDS
euro.mds <- cmdscale(eurodist)
### Use the first two dimensions ###
Dim1 <- euro.mds [,1]
Dim2 <- euro.mds [,2]
plot(Dim1, Dim2, type="n", xlab="", ylab="", main="cmdscale(eurodist)")
segments(-1500, -0, 1500, 0, lty="dotted")
segments(0, -1500, 0, 1500, lty="dotted")
text(Dim1, Dim2, rownames(euro.mds), cex=0.8, col="red")