Source file ⇒ Data_Analysis_for_Life_Sciences4.rmd

Week 1 :Distance

http://genomicsclass.github.io/book/pages/distance.html

High Dimensional Data Analysis:Distance and Dimension Reduction

Introduction

The concept of distance is quite intuitive. For example, when we cluster animals into subgroups, we are implicitly defining a distance that permits us to say what animals are “close” to each other.

Clustering of animals

Clustering of animals

Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance. Many clustering and machine learning techniques rely on being able to define distance, using features or predictors. For example, to create heatmaps, which are widely used in genomics and other high-throughput fields, a distance is computed explicitly.

Example of heatmap

Example of heatmap

Image Source: Heatmap, Gaeddal, 01.28.2007

In these plots the measurements, which are stored in a matrix, are represented with colors after the columns and rows have been clustered. (A side note: red and green, a common color theme for heatmaps, are two of the most difficult colors for many color-blind people to discern.) Here we will learn the necessary mathematics and computing skills to understand and create heatmaps. We start by reviewing the mathematical definition of distance.

Distance

Euclidean Distance

As a review, let’s define the distance between two points, \(A\) and \(B\), on a Cartesian plane.

The euclidean distance between \(A\) and \(B\) is simply:

\[\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\]

Distance in High Dimensions

We introduce a dataset with gene expression measurements for 22,215 genes from 189 samples. The R objects can be downloaded like this:

#library(devtools)
#install_github("genomicsclass/tissuesGeneExpression")

The data represent RNA expression levels for eight tissues, each with several individuals.

library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##e contains the expression data
## [1] 22215   189
#write.csv(e,"tissuesGeneExpression.csv")
tissue
##   [1] "kidney"      "kidney"      "kidney"      "kidney"      "kidney"     
##   [6] "kidney"      "kidney"      "kidney"      "kidney"      "kidney"     
##  [11] "kidney"      "kidney"      "kidney"      "kidney"      "kidney"     
##  [16] "kidney"      "hippocampus" "hippocampus" "hippocampus" "hippocampus"
##  [21] "hippocampus" "hippocampus" "hippocampus" "hippocampus" "hippocampus"
##  [26] "hippocampus" "hippocampus" "hippocampus" "hippocampus" "hippocampus"
##  [31] "hippocampus" "hippocampus" "hippocampus" "hippocampus" "hippocampus"
##  [36] "hippocampus" "hippocampus" "hippocampus" "hippocampus" "hippocampus"
##  [41] "hippocampus" "hippocampus" "hippocampus" "hippocampus" "hippocampus"
##  [46] "hippocampus" "cerebellum"  "cerebellum"  "cerebellum"  "cerebellum" 
##  [51] "cerebellum"  "cerebellum"  "cerebellum"  "cerebellum"  "cerebellum" 
##  [56] "cerebellum"  "cerebellum"  "cerebellum"  "cerebellum"  "cerebellum" 
##  [61] "cerebellum"  "cerebellum"  "cerebellum"  "cerebellum"  "cerebellum" 
##  [66] "cerebellum"  "cerebellum"  "cerebellum"  "cerebellum"  "cerebellum" 
##  [71] "cerebellum"  "cerebellum"  "cerebellum"  "kidney"      "kidney"     
##  [76] "kidney"      "kidney"      "kidney"      "kidney"      "kidney"     
##  [81] "kidney"      "kidney"      "kidney"      "kidney"      "kidney"     
##  [86] "kidney"      "colon"       "colon"       "colon"       "colon"      
##  [91] "colon"       "colon"       "colon"       "colon"       "colon"      
##  [96] "colon"       "colon"       "colon"       "colon"       "colon"      
## [101] "colon"       "colon"       "colon"       "colon"       "colon"      
## [106] "colon"       "colon"       "colon"       "colon"       "colon"      
## [111] "colon"       "colon"       "colon"       "colon"       "colon"      
## [116] "colon"       "colon"       "colon"       "colon"       "colon"      
## [121] "kidney"      "kidney"      "kidney"      "liver"       "liver"      
## [126] "liver"       "kidney"      "kidney"      "kidney"      "liver"      
## [131] "liver"       "liver"       "kidney"      "kidney"      "cerebellum" 
## [136] "hippocampus" "liver"       "liver"       "liver"       "liver"      
## [141] "liver"       "liver"       "cerebellum"  "cerebellum"  "liver"      
## [146] "liver"       "kidney"      "kidney"      "endometrium" "endometrium"
## [151] "endometrium" "endometrium" "endometrium" "endometrium" "endometrium"
## [156] "endometrium" "endometrium" "endometrium" "endometrium" "endometrium"
## [161] "endometrium" "endometrium" "endometrium" "liver"       "liver"      
## [166] "liver"       "liver"       "liver"       "liver"       "liver"      
## [171] "liver"       "liver"       "liver"       "liver"       "liver"      
## [176] "cerebellum"  "cerebellum"  "cerebellum"  "cerebellum"  "cerebellum" 
## [181] "cerebellum"  "cerebellum"  "cerebellum"  "placenta"    "placenta"   
## [186] "placenta"    "placenta"    "placenta"    "placenta"
table(tissue) ##tissue[i] tells us what tissue is represented by e[,i]
## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

We are interested in describing distance between samples in the context of this dataset. We might also be interested in finding genes that behave similarly across samples.

To define distance, we need to know what the points are since mathematical distance is computed between points. With high dimensional data, points are no longer on the Cartesian plane. Instead they are in higher dimensions. For example, sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\). Feature \(g\) is defined by a point in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)

Once we define points, the Euclidean distance is defined in a very similar way as it is defined for two dimensions. For instance, the distance between two samples \(i\) and \(j\) is:

\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]

and the distance between two features \(h\) and \(g\) is:

\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]

Note: In practice, distance between features are typically applied after standardizing the data for each feature. This is equivalent to computing one minus the correlation. This is done because the differences in overall levels between features are often not due to biological effects but technical ones. More details on this topic can be found in this presentation.

Distance with matrix algebra

The distance between samples \(i\) and \(j\) can be written as

\[ \mbox{dist}(i,j) = (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j)\]

with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\). This result can be very convenient in practice as computations can be made much faster using matrix multiplication.

Examples

We can now use the formulas above to compute distance. Let’s compute distance between samples 1 and 2, both kidneys, and then to sample 87, a colon.

x <- e[,1]
y <- e[,2]
z <- e[,87]
tissue[c(1,2,87)]
## [1] "kidney" "kidney" "colon"
sqrt(sum((x-y)^2))
## [1] 85.8546
sqrt(sum((x-z)^2))
## [1] 122.8919

As expected, the kidneys are closer to each other. A faster way to compute this is using matrix algebra:

sqrt( crossprod(x-y) )
##         [,1]
## [1,] 85.8546
sqrt( crossprod(x-z) )
##          [,1]
## [1,] 122.8919

Now to compute all the distances at once, we have the function dist. Because it computes the distance between each row, and here we are interested in the distance between samples, we transpose the matrix

#?dist
d <- dist(t(e))
class(d)
## [1] "dist"

Note that this produces an object of class dist and, to access the entries using row and column, indexes we need to coerce it into a matrix:

as.matrix(d)[1,2]
## [1] 85.8546
as.matrix(d)[1,87]
## [1] 122.8919
image(as.matrix(d))

You could use the image function and just type as.matrix.d, and you’re going to get a color image of all the distances,so that’s that 189 by 189 matrix. The top triangle and the bottom triangle should be the same because the distance between x and y is the same as the distance between y and x.And you can see these red squares.Those red squares are there because the samples are somewhat ordered by tissue so you have tissues that are close on the matrix are also close in biology or biology wise close,and then you get these reddish squares representing the distance between tissues of the same kind. Red is smaller and yellow is larger.OK, so that’s how we compute distance using R.End of transcript. Skip to the start.Downloads and transcripts Video

It is important to remember that if we run dist on e, it will compute all pairwise distances between genes. This will try to create a \(22215 \times 22215\) matrix that may crash your R sessions.

Distance exercises

http://genomicsclass.github.io/book/pages/distance_exercises.html

Distance Exercises #1

If you have not done so already, install the data package tissueGeneExpression.

# library(devtools)
# install_github("genomicsclass/tissuesGeneExpression")

The data represents RNA expression levels for seven tissues, each with several biological replictes. We call samples that we consider to be from the same population, such as liver tissue from different individuals,biological replictes:

library(tissuesGeneExpression)
data(tissuesGeneExpression)
head(e)
##           GSM11805.CEL.gz GSM11814.CEL.gz GSM11823.CEL.gz GSM11830.CEL.gz
## 1007_s_at       10.191267       10.509167       10.272027       10.252952
## 1053_at          6.040463        6.696075        6.144663        6.575153
## 117_at           7.447409        7.775354        7.696235        8.478135
## 121_at          12.025042       12.007817       11.633279       11.075286
## 1255_g_at        5.269269        5.180389        5.301714        5.372235
## 1294_at          8.535176        8.587241        8.277414        8.603650
##           GSM12067.CEL.gz GSM12075.CEL.gz GSM12079.CEL.gz GSM12098.CEL.gz
## 1007_s_at       10.157605        9.966782        9.839348        9.945652
## 1053_at          6.606701        6.060069        6.186596        5.927861
## 117_at           8.116336        7.644452        8.009581        7.847192
## 121_at          10.832528       11.705062       11.706145       11.750370
## 1255_g_at        5.334905        5.253682        5.228794        5.155278
## 1294_at          8.303227        8.341625        8.947253        8.389109
##           GSM12105.CEL.gz GSM12268.CEL.gz GSM12270.CEL.gz GSM12283.CEL.gz
## 1007_s_at        9.913031       10.170344        9.457415       10.126887
## 1053_at          6.337478        6.045789        6.280457        5.882025
## 117_at           7.983850        7.544486        8.571040        7.677892
## 121_at          10.706184       11.760161       10.716936       11.664962
## 1255_g_at        5.236442        5.405336        5.343847        5.098063
## 1294_at          8.513929        8.227427        9.068964        7.987842
##           GSM12298.CEL.gz GSM12300.CEL.gz GSM12399.CEL.gz GSM12444.CEL.gz
## 1007_s_at       10.466426        9.969730       10.512177        9.954500
## 1053_at          6.203864        5.987046        6.449748        6.161354
## 117_at           7.704939        7.615484        8.474977        7.491579
## 121_at          11.657138       11.436698       10.961178       11.554792
## 1255_g_at        5.258476        5.381385        5.323794        5.238425
## 1294_at          8.335472        8.119172        8.522059        8.424942
##           GSM21203.cel.gz GSM21204.cel.gz GSM21205.cel.gz GSM21206.cel.gz
## 1007_s_at       11.166118       10.919300       11.067899       11.157507
## 1053_at          5.853414        5.661951        5.785463        5.831846
## 117_at           7.447074        7.386950        8.160359       11.767046
## 121_at          10.623943       10.986785       11.150964       10.598356
## 1255_g_at        5.268522        5.496033        5.334942        5.204861
## 1294_at          8.401319        7.931769        8.557533        7.605685
##           GSM21207.cel.gz GSM21208.cel.gz GSM21209.cel.gz GSM21210.cel.gz
## 1007_s_at       11.390987       10.889327       11.667431       11.222451
## 1053_at          5.924852        5.652696        5.785190        5.712906
## 117_at           7.956786        7.970167        7.666343        7.195045
## 121_at          11.111799       10.926900       11.010148       10.717852
## 1255_g_at        4.956953        5.233504        5.336258        5.445535
## 1294_at          8.439095        7.914012        8.323745        8.154334
##           GSM21212.cel.gz GSM21213.cel.gz GSM21214.cel.gz GSM21215.cel.gz
## 1007_s_at       11.898028       11.400538       10.955428       10.361340
## 1053_at          5.676397        5.676046        5.776781        5.663634
## 117_at           7.474182        7.927246        7.493743        7.328729
## 121_at          10.801148       11.029615       10.688563       10.498799
## 1255_g_at        5.260948        5.326644        5.243714        6.086277
## 1294_at          8.444386        8.250350        7.797140        7.648405
##           GSM21216.cel.gz GSM21217.cel.gz GSM21218.cel.gz GSM21219.cel.gz
## 1007_s_at       10.771302       10.272978       10.757734       11.371014
## 1053_at          5.791955        5.713266        5.984170        5.736768
## 117_at           7.364371        7.181879        8.525524        7.441310
## 121_at          10.519630       10.375178       10.546843       10.732846
## 1255_g_at        6.872226        7.220032        5.277552        5.532549
## 1294_at          7.790623        7.808684        7.655907        8.105180
##           GSM21220.cel.gz GSM21221.cel.gz GSM21222.cel.gz GSM21223.cel.gz
## 1007_s_at       10.914514       10.573316       11.226525       11.243415
## 1053_at          5.850332        5.844528        5.820929        5.768447
## 117_at           9.212210        7.257813        7.564929        7.552887
## 121_at          10.667456       10.701537       11.001622       10.855300
## 1255_g_at        5.933984        6.002629        5.455578        5.393086
## 1294_at          7.903249        7.840000        7.927451        8.206483
##           GSM21224.cel.gz GSM21225.cel.gz GSM21226.cel.gz GSM21227.cel.gz
## 1007_s_at       11.575293       11.439544       11.009120       10.642950
## 1053_at          5.874935        5.985295        5.750322        5.608566
## 117_at           7.517278        7.360212        7.508019        7.633826
## 121_at          10.740967       10.677039       11.123188       10.565005
## 1255_g_at        5.325502        5.586701        5.248449        5.257288
## 1294_at          8.086701        7.970317        8.006025        7.946024
##           GSM21228.cel.gz GSM21229.cel.gz GSM21230.cel.gz GSM21231.cel.gz
## 1007_s_at       11.296292       11.239662       11.496114       11.077318
## 1053_at          5.822449        6.067430        5.760156        5.957667
## 117_at           7.947366        7.521374        7.787561        7.908426
## 121_at          10.753435       10.868629       10.662749       10.783401
## 1255_g_at        5.530244        5.280227        5.230454        5.191462
## 1294_at          8.080371        7.847783        8.369287        7.818008
##           GSM21232.cel.gz GSM21233.cel.gz GSM87058.cel.gz GSM87064.cel.gz
## 1007_s_at       11.032670       10.733382        9.528979        9.674865
## 1053_at          5.688423        5.822977        5.930689        5.903833
## 117_at           8.080432        7.477119        7.423056        8.999608
## 121_at          11.064665       11.096579       11.051851       11.030058
## 1255_g_at        5.177035        5.135013        5.152332        5.122464
## 1294_at          8.354879        7.919090        7.658205        7.688351
##           GSM87065.cel.gz GSM87066.cel.gz GSM87071.cel.gz GSM87072.cel.gz
## 1007_s_at        9.575169        9.746007        9.804795        9.597073
## 1053_at          5.901084        5.886079        5.967831        5.760703
## 117_at           7.570569        7.459517        7.396987        7.632634
## 121_at          10.959841       10.811034       10.983345       11.160525
## 1255_g_at        5.155192        5.251132        5.114582        5.031275
## 1294_at          7.546350        7.682545        7.342646        7.583204
##           GSM87073.cel.gz GSM87074.cel.gz GSM87075.cel.gz GSM87085.cel.gz
## 1007_s_at        9.611386        9.859942       10.318618        9.864295
## 1053_at          5.911010        6.025677        5.856556        5.753874
## 117_at           7.255444        7.385739        7.048070        7.712646
## 121_at          10.930774       10.828727       10.898723       10.842528
## 1255_g_at        5.135653        5.026157        5.090865        5.264874
## 1294_at          7.547011        7.494077        7.523983        7.657002
##           GSM87086.cel.gz GSM87087.cel.gz GSM87088.cel.gz GSM87089.cel.gz
## 1007_s_at        9.798633        9.758049        9.590453       10.382722
## 1053_at          5.862426        5.904910        5.963214        6.084209
## 117_at           7.279199        7.382182        8.185911        7.330931
## 121_at          10.929617       10.822504       11.194947       10.630318
## 1255_g_at        5.087763        5.183853        5.334466        5.152213
## 1294_at          7.674958        7.513357        7.440385        7.547589
##           GSM87090.CEL.gz GSM87091.cel.gz GSM87092.cel.gz GSM87093.cel.gz
## 1007_s_at        9.744292        9.748162        9.801949        9.540319
## 1053_at          6.076781        6.150398        6.111117        6.107313
## 117_at           7.563120        7.356163        7.409661        7.244511
## 121_at          10.585539       10.899328       10.956315       10.623056
## 1255_g_at        5.192697        5.107863        5.115326        5.285647
## 1294_at          7.581032        7.521760        7.453925        7.441884
##           GSM87094.cel.gz GSM87095.cel.gz GSM87096.cel.gz GSM87097.cel.gz
## 1007_s_at        9.753415        9.616345       10.049081        9.725197
## 1053_at          6.169774        6.194195        6.003842        6.125768
## 117_at           7.338395        7.308199        7.349784        7.310697
## 121_at          10.672483       10.784685       11.023774       10.853290
## 1255_g_at        5.138855        5.150373        5.070798        5.085144
## 1294_at          7.545716        7.565719        7.821006        7.728316
##           GSM87098.cel.gz GSM87099.cel.gz GSM87100.cel.gz GSM87101.cel.gz
## 1007_s_at        9.994052        9.695028        9.654518        9.709508
## 1053_at          6.134039        6.214735        6.057829        6.128327
## 117_at           7.784562        7.447524        7.712780        7.421849
## 121_at          11.117425       10.945455       10.739740       10.924573
## 1255_g_at        5.062101        5.259323        5.022970        5.174359
## 1294_at          7.657950        7.770457        7.469018        7.659710
##           GSM87102.cel.gz GSM146778.CEL.gz GSM146779.CEL.gz
## 1007_s_at        9.578900         9.864295         9.949632
## 1053_at          6.181457         5.715836         6.582316
## 117_at           7.550353         7.420220         7.651495
## 121_at          10.761253        11.754427        11.518203
## 1255_g_at        5.167119         5.098613         5.091193
## 1294_at          7.558538         8.059662         8.773412
##           GSM146781.CEL.gz GSM146782.CEL.gz GSM146783.CEL.gz
## 1007_s_at        10.771730        10.381994        10.008225
## 1053_at           6.381538         5.810887         5.922002
## 117_at            7.795136         7.434689         7.720692
## 121_at           11.258240        11.952897        10.847877
## 1255_g_at         5.126175         5.167193         5.080195
## 1294_at           8.154487         7.986321         8.496441
##           GSM146785.CEL.gz GSM146787.CEL.gz GSM146788.CEL.gz
## 1007_s_at        10.067104        10.355680        10.318169
## 1053_at           6.186499         5.921369         6.210221
## 117_at            7.544709         7.730206         7.958453
## 121_at           10.941188        11.325783        11.502778
## 1255_g_at         5.061560         5.117684         5.103714
## 1294_at           8.034500         8.344711         8.174741
##           GSM146789.CEL.gz GSM146791.CEL.gz GSM146793.CEL.gz
## 1007_s_at        10.180608        10.330264         9.984370
## 1053_at           5.816531         6.186073         6.190721
## 117_at            7.293380         7.487062         7.423247
## 121_at           11.829316        11.462597        10.831227
## 1255_g_at         5.112861         5.141480         5.127723
## 1294_at           7.896776         7.820158         8.256471
##           GSM146795.CEL.gz GSM146797.CEL.gz GSM92240.CEL.gz
## 1007_s_at         9.903414        10.559245       10.658694
## 1053_at           6.229082         6.356300        6.415478
## 117_at            8.537829         8.169326        8.087378
## 121_at           11.312691        11.537642       10.605016
## 1255_g_at         5.068838         5.241162        5.082552
## 1294_at           8.000838         8.110253        8.273383
##           GSM92241.CEL.gz GSM92242.CEL.gz GSM92243.CEL.gz GSM92244.CEL.gz
## 1007_s_at       10.967854       10.373213       11.395477       10.822040
## 1053_at          6.439052        6.523120        6.280099        6.377340
## 117_at           7.892145        7.625365        7.829470        7.461025
## 121_at          10.416664       10.644904       10.669002       10.332522
## 1255_g_at        5.172805        5.168378        5.207066        5.152687
## 1294_at          8.296768        7.853151        7.552782        7.998216
##           GSM92245.CEL.gz GSM92247.CEL.gz GSM92248.CEL.gz GSM92249.CEL.gz
## 1007_s_at       10.077308       10.856773       11.185204       11.039855
## 1053_at          6.287809        6.209920        6.587097        6.239914
## 117_at           7.806562        7.663545        7.655907        7.939491
## 121_at          10.880915       10.643931       10.709260       10.844060
## 1255_g_at        4.929143        4.954293        5.074027        5.115084
## 1294_at          8.058602        8.162227        8.494932        7.652769
##           GSM92250.CEL.gz GSM92253.CEL.gz GSM92254.CEL.gz GSM92255.CEL.gz
## 1007_s_at       11.079038       11.345528       11.682068       10.899083
## 1053_at          6.662952        6.675912        6.274149        6.622457
## 117_at           7.298781        7.560152        7.505362        7.723532
## 121_at          10.725834       10.690677       10.722859       10.595043
## 1255_g_at        5.179113        5.115800        5.000593        5.049459
## 1294_at          8.275607        7.974583        7.912150        8.109445
##           GSM92256.CEL.gz GSM92257.CEL.gz GSM92258.CEL.gz GSM92259.CEL.gz
## 1007_s_at       10.979619       11.148514       11.156803       11.047648
## 1053_at          6.001221        6.147199        6.448577        6.695092
## 117_at           7.944837        8.288436        7.468884        7.595644
## 121_at          10.794864       11.005032       10.759466       10.729882
## 1255_g_at        5.065279        5.076212        4.986409        5.131755
## 1294_at          8.301471        8.424359        8.237984        8.699576
##           GSM92260.CEL.gz GSM92261.CEL.gz GSM92262.CEL.gz GSM92263.CEL.gz
## 1007_s_at       11.209538       11.120316       10.599085       10.742845
## 1053_at          6.660767        6.293271        5.891772        6.519291
## 117_at           7.361785        7.715139        8.029006        7.376248
## 121_at          10.723057       10.645030       11.081891       10.545488
## 1255_g_at        5.114728        5.159226        5.036999        5.154628
## 1294_at          8.112350        8.146838        8.332777        7.811800
##           GSM92264.CEL.gz GSM92265.CEL.gz GSM92266.CEL.gz GSM92267.CEL.gz
## 1007_s_at       11.266456       10.716785       10.828210       11.321320
## 1053_at          6.236967        6.233576        6.643156        5.946477
## 117_at           7.433968        7.760819        8.373691        8.265794
## 121_at          10.458710       10.785353       10.664612       11.039494
## 1255_g_at        5.150165        5.088988        5.157535        5.095064
## 1294_at          8.193679        7.780700        8.627630        8.301142
##           GSM92268.CEL.gz GSM92269.CEL.gz GSM92270.CEL.gz GSM92271.CEL.gz
## 1007_s_at       11.152870       10.646973       10.586756       10.652296
## 1053_at          6.729296        6.349878        6.417514        6.318650
## 117_at           7.569348        7.584153        7.492194        7.790239
## 121_at          10.763125       10.402084       10.861418       10.677872
## 1255_g_at        5.053854        5.262421        5.022686        5.251561
## 1294_at          7.988804        8.188457        8.666162        8.467416
##           GSM92272.CEL.gz GSM92273.CEL.gz GSM92274.CEL.gz GSM92275.CEL.gz
## 1007_s_at       10.725555       11.276537       10.817390       10.391968
## 1053_at          6.566363        6.651955        6.210328        5.923316
## 117_at           7.775408        8.088513        7.750246        8.042244
## 121_at          10.542663       11.072542       10.749497       10.591697
## 1255_g_at        4.978003        4.976790        5.138461        5.103688
## 1294_at          8.320906        7.986398        8.576420        8.403646
##           GSM92276.CEL.gz GSM35979.cel.gz GSM35980.cel.gz GSM35981.cel.gz
## 1007_s_at       11.433151       10.693948       10.883900       10.697330
## 1053_at          6.116081        5.796978        5.944847        5.825616
## 117_at           7.783664        7.943051        7.642011        8.596204
## 121_at          10.916419       12.513036       12.428700       12.507858
## 1255_g_at        5.148009        5.080563        5.037217        5.179477
## 1294_at          8.430276        8.272678        8.282937        8.252376
##           GSM35982.cel.gz GSM35983.cel.gz GSM35984.cel.gz GSM35991.cel.gz
## 1007_s_at        9.374745        9.453892        9.423175       10.921897
## 1053_at          6.110668        5.986093        5.881998        5.827053
## 117_at           8.829727        8.831872        8.911270        7.740220
## 121_at          10.963233       11.061160       10.995782       12.530718
## 1255_g_at        5.259811        5.215710        5.059702        5.248926
## 1294_at          8.218427        8.147624        8.072822        8.293092
##           GSM35992.cel.gz GSM35993.cel.gz GSM35994.cel.gz GSM35995.cel.gz
## 1007_s_at       10.940071       10.925507        9.138816        8.956655
## 1053_at          5.918342        5.921025        6.227796        6.457326
## 117_at           7.959814        7.583922        8.488102        9.176629
## 121_at          12.599032       12.547042       10.583243       10.491863
## 1255_g_at        5.362281        5.459251        5.053995        5.180982
## 1294_at          8.427539        8.409805        8.162649        8.443923
##           GSM35996.cel.gz GSM36003.cel.gz GSM44675.CEL.gz GSM44689.CEL.gz
## 1007_s_at        9.216172       10.599262       10.683256       10.609627
## 1053_at          6.269509        5.679601        6.225941        6.195943
## 117_at           7.763407        8.435550        7.517055        7.557968
## 121_at          10.294510       12.527552       12.198910       10.847165
## 1255_g_at        5.242656        5.050530        5.302377        5.060206
## 1294_at          8.230783        8.513598        7.920678        7.615708
##           GSM44697.CEL.gz GSM44702.CEL.gz GSM181429.CEL.gz
## 1007_s_at       10.804661        9.358658         8.082303
## 1053_at          5.944205        6.348221         6.757782
## 117_at           7.504960        8.783509         8.109752
## 121_at          10.780245       11.025208         9.909142
## 1255_g_at        5.415508        5.214369         5.360358
## 1294_at          7.587373        8.050341         8.488678
##           GSM181430.CEL.gz GSM181431.CEL.gz GSM181432.CEL.gz
## 1007_s_at         8.672800         8.420832         8.449043
## 1053_at           6.711571         6.529189         6.403990
## 117_at            8.454446         8.121711         7.889064
## 121_at           10.559288        10.153621        10.685172
## 1255_g_at         4.950746         5.267882         5.316058
## 1294_at           8.048616         8.693640         8.697370
##           GSM181433.CEL.gz GSM18917.CEL.gz GSM18918.CEL.gz GSM18953.CEL.gz
## 1007_s_at         8.391161       10.341440       10.327703        9.884981
## 1053_at           6.418597        5.888530        5.959998        5.941559
## 117_at            8.297173        8.745092        9.106117        8.905465
## 121_at           10.355128       11.755276       11.710110       11.532460
## 1255_g_at         5.195971        5.244010        5.189766        5.517002
## 1294_at           8.456322        8.510393        8.066057        8.754572
##           GSM18954.CEL.gz GSM18955.CEL.gz GSM18956.CEL.gz GSM296875.CEL.gz
## 1007_s_at        9.569800        9.753486       10.059182        10.611813
## 1053_at          5.974681        5.910128        5.884142         6.172908
## 117_at           8.834194        8.302051        8.150492         7.576738
## 121_at          11.273357       11.984180       12.119464        12.239785
## 1255_g_at        5.413542        5.218646        5.130751         5.230148
## 1294_at          8.616671        8.544840        8.585335         8.089192
##           GSM296876.CEL.gz GSM296878.CEL.gz GSM296879.CEL.gz
## 1007_s_at         9.743855        10.626825        10.186927
## 1053_at           6.068650         6.113875         6.108924
## 117_at            7.652066         7.404070         8.326484
## 121_at           10.687793        12.115210        11.788156
## 1255_g_at         5.140055         5.082787         5.087534
## 1294_at           8.575553         8.217176         8.107024
##           GSM296880.CEL.gz GSM296881.CEL.gz GSM296882.CEL.gz
## 1007_s_at        10.508280        10.159876        10.013988
## 1053_at           6.193704         6.303625         5.979658
## 117_at            7.643538         7.436318         7.449793
## 121_at           11.809707        11.272824        11.669962
## 1255_g_at         5.102386         5.103723         5.010524
## 1294_at           8.425690         8.054656         8.447353
##           GSM296883.CEL.gz GSM296886.CEL.gz GSM296887.CEL.gz
## 1007_s_at        10.343827         9.426150         9.732753
## 1053_at           6.115159         6.221048         5.887962
## 117_at            7.618541         7.752332         7.393900
## 121_at           12.159957        10.743438        10.669130
## 1255_g_at         5.200034         5.013313         5.077699
## 1294_at           8.181504         8.458356         8.497650
##           GSM296888.CEL.gz GSM296889.CEL.gz GSM296890.CEL.gz
## 1007_s_at         9.505552         9.489496         9.669966
## 1053_at           5.973776         6.045113         5.853426
## 117_at            7.664099         7.309241         7.512285
## 121_at           10.839761        10.444710        10.824855
## 1255_g_at         5.252535         5.023222         5.176021
## 1294_at           8.210043         8.481768         8.603354
##           GSM296891.CEL.gz GSM296892.CEL.gz GSM298747.CEL.gz
## 1007_s_at         9.477328         9.645158         8.430692
## 1053_at           6.112785         6.158746         5.845383
## 117_at            8.707780         7.512748         8.786429
## 121_at           10.855914        10.813215        10.407874
## 1255_g_at         5.131532         5.289147         5.242395
## 1294_at           8.631548         8.511193         8.548888
##           GSM298748.CEL.gz GSM298749.CEL.gz GSM298750.CEL.gz
## 1007_s_at         8.362492         8.746337         8.280933
## 1053_at           6.072837         6.004847         6.162781
## 117_at            8.886452         8.885333         7.441336
## 121_at           10.448253        10.474274        10.080939
## 1255_g_at         5.308802         5.266128         5.257030
## 1294_at           8.367323         8.530230         8.495641
##           GSM299110.CEL.gz GSM299111.CEL.gz GSM299112.CEL.gz
## 1007_s_at         8.430692         8.746337         8.280933
## 1053_at           5.845383         6.004847         6.162781
## 117_at            8.786429         8.885333         7.441336
## 121_at           10.407874        10.474274        10.080939
## 1255_g_at         5.242395         5.266128         5.257030
## 1294_at           8.548888         8.530230         8.495641
##           GSM299113.CEL.gz GSM299244.CEL.gz GSM299245.CEL.gz
## 1007_s_at         8.362492         8.601285         8.714787
## 1053_at           6.072837         6.049479         6.214427
## 117_at            8.886452         8.650064         8.092632
## 121_at           10.448253        10.418271        10.264134
## 1255_g_at         5.308802         5.018479         5.308032
## 1294_at           8.367323         8.201954         8.150090
##           GSM299246.CEL.gz GSM299247.CEL.gz GSM322969.CEL.gz
## 1007_s_at         8.709645         8.600033        11.920317
## 1053_at           6.125196         6.098775         6.423409
## 117_at            8.292887         8.591519         7.465062
## 121_at           10.545228        10.251836        10.432677
## 1255_g_at         5.124959         5.278551         5.336957
## 1294_at           8.151717         8.652012         8.030188
##           GSM323054.CEL.gz GSM323523.CEL.gz GSM323524.CEL.gz
## 1007_s_at        11.919931        11.747558        11.805023
## 1053_at           6.057275         6.192436         6.361326
## 117_at            7.409525         7.454402         7.235285
## 121_at           10.366122        10.172419        10.351052
## 1255_g_at         5.342778         5.335844         5.359751
## 1294_at           7.708620         7.994306         8.124769
##           GSM323527.CEL.gz GSM323565.CEL.gz GSM323566.CEL.gz
## 1007_s_at        11.797743        10.040886        11.285002
## 1053_at           6.157979         6.224848         6.170956
## 117_at            7.727192         7.573437         7.323547
## 121_at           10.443504        10.566722        10.250737
## 1255_g_at         5.263736         5.048867         5.062005
## 1294_at           7.825729         7.537564         7.985414
##           GSM323567.CEL.gz GSM246492.CEL.gz GSM246493.CEL.gz
## 1007_s_at         9.888693         9.661127         9.803686
## 1053_at           6.211522         6.270153         6.058488
## 117_at            7.228568         7.333568         7.486711
## 121_at           10.191332         9.703713         9.914632
## 1255_g_at         5.186962         5.158631         5.312712
## 1294_at           7.430644         7.821741         7.421166
##           GSM246494.CEL.gz GSM307639.CEL.gz GSM307640.CEL.gz
## 1007_s_at        10.509541         9.984502         9.937738
## 1053_at           6.345526         6.715984         6.836179
## 117_at            7.468406         7.120793         7.125811
## 121_at            9.909784         9.409933         9.587782
## 1255_g_at         5.341193         4.896124         5.296695
## 1294_at           7.900080         8.119396         8.005432
##           GSM307641.CEL.gz
## 1007_s_at        10.306781
## 1053_at           7.025547
## 117_at            7.407624
## 121_at            9.792904
## 1255_g_at         5.206251
## 1294_at           8.283758
head(tissue)
## [1] "kidney" "kidney" "kidney" "kidney" "kidney" "kidney"

Q:How many biological replicates for hippocampus?

sum(tissue=="hippocampus")
## [1] 31
##to answer this question for all tissues look at
table(tissue)
## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6

A:31

Distance Exercises #2

What is the distance between samples 3 and 45?

d <- dist(t(e))
class(d)
## [1] "dist"
as.matrix(d)[3,45]
## [1] 152.5662
#or

sqrt( crossprod(e[,3]-e[,45]) )
##          [,1]
## [1,] 152.5662
## or
sqrt( sum((e[,3]-e[,45])^2 ))
## [1] 152.5662

A:152.5662

Distance Exercises #3

What is the distance between gene 210486_at and 200805_at

x<-e["210486_at",]
y<-e["200805_at",]
sqrt( crossprod(x-y) )
##          [,1]
## [1,] 41.01153
#or
sqrt( crossprod(e["210486_at",]-e["200805_at",]) )
##          [,1]
## [1,] 41.01153
## or
sqrt( sum((e["210486_at",]-e["200805_at",])^2 ))
## [1] 41.01153

A: 41.01153

Distance Exercises #4

If I run the command (don’t run it!) d = as.matrix(dist( e))

22215*22215
## [1] 493506225
##every pair of rows has an entry:
nrow(e)^2
## [1] 493506225

A:493506225

Distance Exercises #5

Compute the distance between all pairs of samples: Q:How many distances are stored in d? (Hint: What is the length of d)?

d = dist(t(e))
length(d)
## [1] 17766

A:17766

Distance Exercises #6

ncol(e)^2 #Incorrect
## [1] 35721
nrow(e)^2 #Incorrect
## [1] 493506225
ncol(e)*(ncol(e)-1)/2 #Because we take advantage of symmetry: only lower triangular matrix is stored thus only 
## [1] 17766

Explanation:Note that the distance between samples i and j is the same as distance between samples j and i. Also the distance between a sample and itself is 0. The object returned by dist avoids storing all those values.

Dimension Reduction Motivation

http://genomicsclass.github.io/book/pages/pca_motivation.html

Visualizing data is one of the most, if not the most, important step in the analysis of high-throughput data. The right visualization method may reveal problems with the experimental data that can render the results from a standard analysis, although typically appropriate, completely useless.

We have shown methods for visualizing global properties of the columns or rows, but plots that reveal relationships between columns or between rows are more complicated due to the high dimensionality of data. For example, to compare each of the 189 samples to each other, we would have to create, for example, 17,766 MA-plots. Creating one single scatterplot of the data is impossible since points are very high dimensional.

We will describe powerful techniques for exploratory data analysis based on dimension reduction. The general idea is to reduce the dataset to have fewer dimensions, yet approximately preserve important properties, such as the distance between samples. If we are able to reduce down to, say, two dimensions, we can then easily make plots. The technique behind it all, the singular value decomposition (SVD), is also useful in other contexts. Before introducing the rather complicated mathematics behind the SVD, we will motivate the ideas behind it with a simple example.

Example: Reducing two dimensions to one

We consider an example with twin heights. Here we simulate 100 two dimensional points that represent the number of standard deviations each individual is from the mean height. Each point is a pair of twins:

Simulated twin pair heights.

Simulated twin pair heights.

To help with the illustration, think of this as high-throughput gene expression data with the twin pairs representing the \(N\) samples and the two heights representing gene expression from two genes.

We are interested in the distance between any two samples. We can compute this using dist. For example, here is the distance between the two orange points in the figure above:

d=dist(t(y))
as.matrix(d)[1,2]
## [1] 1.140897

What if making two dimensional plots was too complex and we were only able to make 1 dimensional plots. Can we, for example, reduce the data to a one dimensional matrix that preserves distances between points?

If we look back at the plot, and visualize a line between any pair of points, the length of this line is the distance between the two points. These lines tend to go along the direction of the diagonal. We have seen before that we can “rotate” the plot so that the diagonal is in the x-axis by making a MA-plot instead:

z1 = (y[1,]+y[2,])/2 #the sum 
z2 = (y[1,]-y[2,])   #the difference

z = rbind( z1, z2) #matrix now same dimensions as y

thelim <- c(-3,3)
mypar(1,2)

plot(y[1,],y[2,],xlab="Twin 1 (standardized height)",ylab="Twin 2 (standardized height)",xlim=thelim,ylim=thelim)
points(y[1,1:2],y[2,1:2],col=2,pch=16)

plot(z[1,],z[2,],xlim=thelim,ylim=thelim,xlab="Average height",ylab="Differnece in height")
points(z[1,1:2],z[2,1:2],col=2,pch=16)
Twin height scatterplot (left) and MA-plot (right).

Twin height scatterplot (left) and MA-plot (right).

Later, we will start using linear algebra to represent transformation of the data such as this. Here we can see that to get z we multiplied y by the matrix:

\[ A = \, \begin{pmatrix} 1/2&1/2\\ 1&-1\\ \end{pmatrix} \implies z = A y \]

Remember that we can transform back by simply multiplying by \(A^{-1}\) as follows:

\[ A^{-1} = \, \begin{pmatrix} 1&1/2\\ 1&-1/2\\ \end{pmatrix} \implies y = A^{-1} z \]

Rotations

In the plot above, the distance between the two orange points remains roughly the same, relative to the distance between other points. This is true for all pairs of points. A simple re-scaling of the transformation we performed above will actually make the distances exactly the same. What we will do is multiply by a scalar so that the standard deviations of each point is preserved. If you think of the columns of y as independent random variables with standard deviation \(\sigma\), then note that the standard deviations of \(M\) and \(A\) are:

\[ \mbox{sd}[ Z_1 ] = \mbox{sd}[ (Y_1 + Y_2) / 2 ] = \frac{1}{\sqrt{2}} \sigma \mbox{ and } \mbox{sd}[ Z_2] = \mbox{sd}[ Y_1 - Y_2 ] = {\sqrt{2}} \sigma \]

This implies that if we change the transformation above to:

\[ A = \frac{1}{\sqrt{2}} \begin{pmatrix} 1&1\\ 1&-1\\ \end{pmatrix} \]

then the SD of the columns of \(Y\) are the same as the variance of the columns \(Z\). Also, notice that \(A^{-1}A=I\). We call matrices with this properties orthogonal and it guarantees the SD-preserving properties described above. The distances are now exactly preserved:

A <- 1/sqrt(2)*matrix(c(1,1,1,-1),2,2)
z <- A%*%y
d <- dist(t(y))
d2 <- dist(t(z))
mypar(1,1)
plot(as.numeric(d),as.numeric(d2)) #as.numeric turns distnaces into long vector
abline(0,1,col=2)
Distance computed from original data and after rotation is the same.

Distance computed from original data and after rotation is the same.

We call this particular transformation a rotation of y.

mypar(1,2)

thelim <- c(-3,3)
plot(y[1,],y[2,],xlab="Twin 1 (standardized height)",ylab="Twin 2 (standardized height)",xlim=thelim,ylim=thelim)
points(y[1,1:2],y[2,1:2],col=2,pch=16)

plot(z[1,],z[2,],xlim=thelim,ylim=thelim,xlab="Average height",ylab="Differnece in height")
points(z[1,1:2],z[2,1:2],col=2,pch=16)
Twin height scatterplot (left) and after rotation (right).

Twin height scatterplot (left) and after rotation (right).

The reason we applied this transformation in the first place was because we noticed that to compute the distances between points, we followed a direction along the diagonal in the original plot, which after the rotation falls on the horizontal, or the the first dimension of z. So this rotation actually achieves what we originally wanted: we can preserve the distances between points with just one dimension. Let’s remove the second dimension of z and recompute distances:

d3 = dist(z[1,]) ##distance computed using just first dimension
mypar(1,1)
plot(as.numeric(d),as.numeric(d3)) 
abline(0,1)
Distance computed with just one dimension after rotation versus actual distance.

Distance computed with just one dimension after rotation versus actual distance.

The distance computed with just the one dimensions provides a very good approximation to the actual distance and a very useful dimension reduction: from 2 dimensions to 1. This first dimension of the transformed data is actually the first principal component. This idea motivates the use of principal component analysis (PCA) and the singular value decomposition (SVD) to achieve dimension reduction more generally.

Important note on a difference to other explanations

If you search the web for descriptions of PCA, you will notice a difference in notation to how we describe it here. This mainly stems from the fact that it is more common to have rows represent units. Hence, in the example shown here, \(Y\) would be transposed to be an \(N \times 2\) matrix. In statistics this is also the most common way to represent the data: individuals in the rows. However, for practical reasons, in genomics it is more common to represent units in the columns. For example, genes are rows and samples are columns. For this reason, in this book we explain PCA and all the math that goes with it in a slightly different way than it is usually done. As a result, many of the explanations you find for PCA start out with the sample covariance matrix usually denoted with \(\mathbf{X}^\top\mathbf{X}\) and having cells representing covariance between two units. Yet for this to be the case, we need the rows of \(\mathbf{X}\) to represents units. So in our notation above, you would have to compute, after scaling, \(\mathbf{Y}\mathbf{Y}^\top\) instead.

Basically, if you want our explanations to match others you have to transpose the matrices we show here.

Week 2: Dimension Reduction

Projections

http://genomicsclass.github.io/book/pages/projections.html

Now that we have described the concept of dimension reduction and some of the applications of SVD and principal component analysis, we focus on more details related to the mathematics behind these. We start with projections. A projection is a linear algebra concept that helps us understand many of the mathematical operations we perform on high-dimensional data. For more details, you can review projects in a linear algebra book. Here we provide a quick review and then provide some data analysis related examples.

As a review, remember that projections minimize the distance between points and subspace.

Illustration of projection

Illustration of projection

In the figure above, the arrow on top is pointing to a point in space. In this particular cartoon, the space is two dimensional, but we should be thinking abstractly. The space is represented by the Cartesian plan and the line on which the little person stands is a subspace of points. The projection to this subspace is the place that is closest to the original point. Geometry tells us that we can find this closest point by dropping a perpendicular line (dotted line) from the point to the space. The little person is standing on the projection. The amount this person had to walk from the origin to the new projected point is referred to as the coordinate.

For the explanation of projections, we will use the standard matrix algebra notation for points: \(\vec{y} \in \mathbb{R}^N\) is a point in \(N\)-dimensional space and \(L \subset \mathbb{R}^N\) is smaller subspace.

Simple example with N=2

If we let \(Y = \begin{pmatrix} 2 \\ 3\end{pmatrix}\). We can plot it like this:

mypar (1,1)
plot(c(0,4),c(0,4),xlab="Dimension 1",ylab="Dimension 2",type="n")
arrows(0,0,2,3,lwd=3)
text(2,3," Y",pos=4,cex=3)
Geometric representation of Y.

Geometric representation of Y.

We can immediately define a coordinate system by projecting this vector to the space defined by: \(\begin{pmatrix} 1\\ 0\end{pmatrix}\) (the x-axis) and \(\begin{pmatrix} 0\\ 1\end{pmatrix}\) (the y-axis). The projections of \(Y\) to the subspace defined by these points are 2 and 3 respectively:

\[ \begin{align*} Y &= \begin{pmatrix} 2 \\ 3\end{pmatrix} \\ &=2 \begin{pmatrix} 1\\ 0\end{pmatrix} + 3 \begin{pmatrix} 0\\ 1\end{pmatrix} \end{align*}\]

We say that \(2\) and \(3\) are the coordinates and that \(\begin{pmatrix} 1\\ 0\end{pmatrix} \mbox{and} \begin{pmatrix} 0\\1 \end{pmatrix}\) are the bases.

Now let’s define a new subspace. The red line in the plot below is subset \(L\) defined by points satisfying \(c \vec{v}\) with \(\vec{v}=\begin{pmatrix} 2& 1\end{pmatrix}^\top\). The projection of \(\vec{y}\) onto \(L\) is the closest point on \(L\) to \(\vec{y}\). So we need to find the \(c\) that minimizes the distance between \(\vec{y}\) and \(c\vec{v}=(2c,c)\). In linear algebra, we learn that the difference between these points is orthogonal to the space so:

\[ (\vec{y}-\hat{c}\vec{v}) \cdot \vec{v} = 0 \]

this implies that:

\[ \vec{y}\cdot\vec{v} - \hat{c}\vec{v}\cdot\vec{v} = 0 \]

and:

\[\hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} \]

Here the dot \(\cdot\) represents the dot product: \(\,\, \vec{x} \cdot \vec{y} = x_1 y_1+\dots x_n y_n\).

The following R code confirms this equation works:

mypar(1,1)
plot(c(0,4),c(0,4),xlab="Dimension 1",ylab="Dimension 2",type="n")
arrows(0,0,2,3,lwd=3)
abline(0,0.5,col="red",lwd=3) #if x=2c and y=c then slope is 0.5 (y=0.5x)
text(2,3," Y",pos=4,cex=3)
y=c(2,3)
x=c(2,1)
cc = crossprod(x,y)/crossprod(x)
segments(x[1]*cc,x[2]*cc,y[1],y[2],lty=2)
text(x[1]*cc,x[2]*cc,expression(hat(Y)),pos=4,cex=3)
Projection of Y onto new subspace.

Projection of Y onto new subspace.

Note that if \(\vec{v}\) was such that \(\vec{v}\cdot \vec{v}=1\), then \(\hat{c}\) is simply \(\vec{y} \cdot \vec{v}\) and the space \(L\) does not change. This simplification is one reason we like orthogonal matrices.

Example: The sample mean is a projection

Let \(\vec{y} \in \mathbb{R}^N\) and \(L \subset \mathbb{R}^N\) is the space spanned by:

\[\vec{v}=\begin{pmatrix} 1\\ \vdots \\ 1\end{pmatrix}; L = \{ c \vec{v}; c \in \mathbb{R}\} \]

In this space, all components of the vectors are the same number, so we can think of this space as representing the constants: in the projection each dimension will be the same value. So what \(c\) minimizes the distance between \(c\vec{v}\) and \(\vec{y}\) ?

When talking about problems like this, we sometimes use 2 dimensional figures such as the one above. We simply abstract and think of \(\vec{y}\) as a point in \(N-dimensions\) and \(L\) as a subspace defined by a smaller number of values, in this case just one: \(c\).

Getting back to our question, we know that the projection is:

\[\hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} \]

which in this case is the average:

\[ \hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} = \frac{\sum_{i=1}^N Y_i}{\sum_{i=1}^N 1} = \bar{Y} \]

Here, it also would have been just as easy to use calculus:

\[\frac{\partial}{\partial c}\sum_{i=1}^N (Y_i - c)^2 = 0 \implies - 2 \sum_{i=1}^N (Y_i - \hat{c}) = 0 \implies\]

\[ N c = \sum_{i=1}^N Y_i \implies \hat{c}=\bar{Y}\]

Example: Regression is also a projection

Let us give a slightly more complicated example. Simple linear regression can also be explained with projections. Our data \(\mathbf{Y}\) (we are no longer going to use the \(\vec{y}\) notation) is again an \(N-dimensional\) vector and our model predicts \(Y_i\) with a line \(\beta_0 + \beta_1 X_i\). We want to find the \(\beta_0\) and \(\beta_1\) that minimize the distance between \(Y\) and the space defined by:

\[ L = \{ \beta_0 \vec{v}_0 + \beta_1 \vec{v}_1 ; \vec{\beta}=(\beta_0,\beta_1) \in \mathbb{R}^2 \}\]

with:

\[ \vec{v}_0= \begin{pmatrix} 1\\ 1\\ \vdots \\ 1\\ \end{pmatrix} \mbox{ and } \vec{v}_1= \begin{pmatrix} X_{1}\\ X_{2}\\ \vdots \\ X_{N}\\ \end{pmatrix} \]

Our \(N\times 2\) matrix \(\mathbf{X}\) is \([ \vec{v}_0 \,\, \vec{v}_1]\) and any point in \(L\) can be written as \(X\vec{\beta}\).

The equation for the multidimensional version of orthogonal projection is:

\[X^\top (\vec{y}-X\vec{\beta}) = 0\]

which we have seen before and gives us:

\[X^\top X \hat{\beta}= X^\top \vec{y} \]

\[\hat{\beta}= (X^\top X)^{-1}X^\top \vec{y}\]

And the projection to \(L\) is therefore:

\[X (X^\top X)^{-1}X^\top \vec{y}\]

Rotations

http://genomicsclass.github.io/book/pages/rotations.html

One of the most useful applications of projections relates to coordinate rotations. In data analysis, simple rotations can result in easier to visualize and interpret data. We will describe the mathematics behind rotations and give some data analysis examples.

In our previous section, we used the following example:

\[ Y = \begin{pmatrix} 2 \\ 3 \end{pmatrix} = 2 \begin{pmatrix} 1\\ 0 \end{pmatrix} + 3 \begin{pmatrix} 0\\ 1 \end{pmatrix} \]

and noted that \(2\) and \(3\) are the coordinates

library(rafalib)
mypar()
plot(c(-2,4),c(-2,4),xlab="Dimension 1",ylab="Dimension 2",type="n",xaxt="n",yaxt="n",bty="n")
text(rep(0,6),c(c(-2,-1),c(1:4)),as.character(c(c(-2,-1),c(1:4))),pos=2)
text(c(c(-2,-1),c(1:4)),rep(0,6),as.character(c(c(-2,-1),c(1:4))),pos=1)
abline(v=0,h=0)
arrows(0,0,2,3,lwd=3)
segments(2,0,2,3,lty=2)
segments(0,3,2,3,lty=2)
text(2,3," Y",pos=4,cex=3)

However, mathematically we can represent the point \((2,3)\) with other linear combinations:

\[ \begin{align*} Y &= \begin{pmatrix} 2 \\ 3\end{pmatrix} \\ &= 2.5 \begin{pmatrix} 1\\ 1\end{pmatrix} + -1 \begin{pmatrix} \phantom{-}0.5\\ -0.5\end{pmatrix} \end{align*}\]

The new coordinates are:

\[Z = \begin{pmatrix} 2.5 \\ -1 \end{pmatrix} \]

Graphically, we can see that the coordinates are the projections to the spaces defined by the new basis:

library(rafalib)
mypar()
plot(c(-2,4),c(-2,4),xlab="Dimension 1",ylab="Dimension 2",type="n",xaxt="n",yaxt="n",bty="n")
text(rep(0,6),c(c(-2,-1),c(1:4)),as.character(c(c(-2,-1),c(1:4))),pos=2)
text(c(c(-2,-1),c(1:4)),rep(0,6),as.character(c(c(-2,-1),c(1:4))),pos=1)
abline(v=0,h=0)
abline(0,1,col="red")
abline(0,-1,col="red")
arrows(0,0,2,3,lwd=3)
y=c(2,3)
x1=c(1,1)##new basis
x2=c(0.5,-0.5)##new basis
c1 = crossprod(x1,y)/crossprod(x1)
c2 = crossprod(x2,y)/crossprod(x2)
segments(x1[1]*c1,x1[2]*c1,y[1],y[2],lty=2)
segments(x2[1]*c2,x2[2]*c2,y[1],y[2],lty=2)
text(2,3," Y",pos=4,cex=3)

We can go back and forth between these two representations of \((2,3)\) using matrix multiplication.

\[ Y = AZ\\ \]

\[ A^{-1} Y = Z\\ \]

\[ A= \begin{pmatrix} 1& \phantom{-}0.5\\ 1 & -0.5\end{pmatrix} \implies A^{-1}= \begin{pmatrix} 0.5& 0.5 \\ 1 &-1\end{pmatrix} \]

\(Z\) and \(Y\) carry the same information, but in a different coordinate system.

Example: Twin heights

Here are 100 two dimensional points \(Y\)

Here are the rotations: \[Z = A^{-1} Y\]

What we have done here is rotate the data so that the first coordinate of \(Z\) is the average height, while the second is the difference between twin heights.

We have used the singular value decomposition to find principal components. It is sometimes useful to think of the SVD as a rotation, for example \(\mathbf{U}^\top \mathbf{Y}\), that gives us a new coordinate system \(\mathbf{DV}^\top\) in which the dimensions are ordered by how much variance they explain.

Projections Exercises

We will continue to use this dataset:

library(Biobase) library(GSE5859Subset) data(GSE5859Subset)

Projections Exercises #1

Suppose you want to make an MA plot of the first two samples y = geneExpression[,1:2]. Which of the following projections of y gives us new coordinates such that column 2 versus column 1 is an MA plot?

1

1

2

2

Singular Value Decomposition (SVD)

http://genomicsclass.github.io/book/pages/svd.html

In the previous section, we motivated dimension reduction and showed a transformation that permitted us to approximate the distance between two dimensional points with just one dimension. The singular value decomposition (SVD) is a generalization of the algorithm we used in the motivational section. As in the example, the SVD provides a transformation of the original data. This transformation has some very useful properties.

The main result SVD provides is that we can write an \(m \times n\), matrix \(\mathbf{Y}\) as

\[\mathbf{U}^\top\mathbf{Y} = \mathbf{DV}^\top\]

With:

  • \(\mathbf{U}\) is an \(m \times p\) orthogonal matrix
  • \(\mathbf{V}\) is an \(p \times p\) orthogonal matrix
  • \(\mathbf{D}\) is an \(n \times p\) diagonal matrix

with \(p=\mbox{min}(m,n)\). \(\mathbf{U}^\top\) provides a rotation of our data \(\mathbf{Y}\) that turns out to be very useful because the variability (sum of squares to be precise) of the columns of \(\mathbf{U}^\top \mathbf{Y}=\mathbf{VD}\) are decreasing. Because \(\mathbf{U}\) is orthogonal, we can write the SVD like this:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

In fact, this formula is much more commonly used. We can also write the transformation like this:

\[\mathbf{YV} = \mathbf{UD}\]

This transformation of \(Y\) also results in a matrix with column of decreasing sum of squares.

Applying the SVD to the motivating example we have:

library(rafalib)
library(MASS)
n <- 100
y <- t(mvrnorm(n,c(0,0), matrix(c(1,0.95,0.95,1),2,2)))

mypar()
LIM<-c(-3.5,3.5)
plot(y[1,],y[2,],xlim = LIM,ylim = LIM)

s <- svd(y)

We can immediately see that applying the SVD results in a transformation very similar to the one we used in the motivating example:

round(sqrt(2) * s$u , 3)
##        [,1]   [,2]
## [1,] -0.982 -1.017
## [2,] -1.017  0.982

The plot we showed after the rotation, was showing what we call the principal components: the second plotted against the first. To obtain the principal components from the SVD, we simply need the columns of the rotation \(\mathbf{U}^\top\mathbf{Y}\) :

PC1 = s$d[1]*s$v[,1]
PC2 = s$d[2]*s$v[,2]
plot(PC1,PC2,xlim=c(-3,3),ylim=c(-3,3))
Second PC plotted against first PC for the twins height data

Second PC plotted against first PC for the twins height data

How is this useful?

It is not immediately obvious how incredibly useful the SVD can be, so let’s consider some examples. In this example, we will greatly reduce the dimension of \(V\) and still be able to reconstruct \(Y\).

Let’s compute the SVD on the gene expression table we have been working with. We will take a subset of 100 genes so that computations are faster.

library(tissuesGeneExpression)
data(tissuesGeneExpression)
set.seed(1)
ind <- sample(nrow(e),500) 
Y <- t(apply(e[ind,],1,scale)) #standardize data for illustration

The svd command returns the three matrices (only the diagonal entries are returned for \(D\))

s <- svd(Y)
U <- s$u
V <- s$v
D <- diag(s$d) ##turn it into a matrix

First note that we can in fact reconstruct y:

Yhat <- U %*% D %*% t(V)
resid <- Y - Yhat
max(abs(resid))
## [1] 3.552714e-14

If we look at the sum of squares of \(\mathbf{UD}\), we see that the last few are quite close to 0 (perhaps we have some replicated columns).

plot(s$d)
Entries of the diagonal of D for gene expression data.

Entries of the diagonal of D for gene expression data.

This implies that the last columns of V have a very small effect on the reconstruction of Y. To see this, consider the extreme example in which the last entry of \(V\) is 0. In this case the last column of \(V\) is not needed at all. Because of the way the SVD is created, the columns of \(V\), have less and less influence on the reconstruction of \(Y\). You commonly see this described as “explaining less variance”. This implies that for a large matrix, by the time you get to the last columns, it is possible that there is not much left to “explain” As an example, we will look at what happens if we remove the four last columns:

k <- ncol(U)-4
Yhat <- U[,1:k] %*% D[1:k,1:k] %*% t(V[,1:k])
resid <- Y - Yhat 
max(abs(resid))
## [1] 3.552714e-14

The largest residual is practically 0, meaning that we Yhat is practically the same as Y, yet we need 4 less dimensions to transmit the information.

By looking at \(d\), we can see that, in this particular dataset, we can obtain a good approximation keeping only 94 columns. The following plots are useful for seeing how much of the variability is explained by each column:

plot(s$d^2/sum(s$d^2)*100,ylab="Percent variability explained")
Percent variance explained by each principal component of gene expression data.

Percent variance explained by each principal component of gene expression data.

We can also make a cumulative plot:

plot(cumsum(s$d^2)/sum(s$d^2)*100,ylab="Percent variability explained",ylim=c(0,100),type="l")
Cumulative variance explained by principal components of gene expression data.

Cumulative variance explained by principal components of gene expression data.

Although we start with 189 dimensions, we can approximate \(Y\) with just 95:

k <- 95 ##out a possible 189
Yhat <- U[,1:k] %*% D[1:k,1:k] %*% t(V[,1:k])
resid <- Y - Yhat
boxplot(resid,ylim=quantile(Y,c(0.01,0.99)),range=0)
Residuals from comparing a reconstructed gene expression table using 95 PCs to the original data with 189 dimensions.

Residuals from comparing a reconstructed gene expression table using 95 PCs to the original data with 189 dimensions.

Therefore, by using only half as many dimensions, we retain most of the variability in our data:

var(as.vector(resid))/var(as.vector(Y))
## [1] 0.04076899

We say that we explain 96% of the variability.

Note that we can compute this proportion from \(D\):

1-sum(s$d[1:k]^2)/sum(s$d^2)
## [1] 0.04076899

The entries of \(D\) therefore tell us how much each PC contributes in term of variability explained.

Highly correlated data

To help understand how the SVD works, we construct a dataset with two highly correlated columns.

For example:

m <- 100
n <- 2
x <- rnorm(m)
e <- rnorm(n*m,0,0.01)
Y <- cbind(x,x)+e
cor(Y)
##           x         x
## x 1.0000000 0.9998873
## x 0.9998873 1.0000000

In this case, the second column adds very little “information” since all the entries of Y[,1]-Y[,2] are close to 0. Reporting rowMeans(Y) is even more efficient since Y[,1]-rowMeans(Y) and Y[,2]-rowMeans(Y) are even closer to 0. rowMeans(Y) turns out to be the information represented in the first column on \(U\). The SVD helps us notice that we explain almost all the variability with just this first column:

d <- svd(Y)$d
d[1]^2/sum(d^2)
## [1] 0.9999441

In cases with many correlated columns, we can achieve great dimension reduction:

m <- 100
n <- 25
x <- rnorm(m)
e <- rnorm(n*m,0,0.01)
Y <- replicate(n,x)+e
d <- svd(Y)$d
d[1]^2/sum(d^2)
## [1] 0.9999047

SVD Exercises

For the following questions use the data loaded with:

library(tissuesGeneExpression) data(tissuesGeneExpression)

3

3

s = svd(e) signflips = sample(c(-1,1),ncol(e),replace=TRUE) signflips

4

4

newu= sweep(s\(u,2,signflips,FUN="*") newv= sweep(s\)v,2,signflips,FUN=“" ) all.equal( s\(u %*% diag(s\)d) %% t(s\(v), newu %*% diag(s\)d) %*% t(newv))

This is important to know because different implementations of the SVD algorithm may give different signs, which can lead to the same code resulting in different answers when run in different computer systems.

Compute the SVD of e

library(tissuesGeneExpression)
data(tissuesGeneExpression)
s = svd(e)

Now compute the mean of each row:

m = rowMeans(e)

Q:What is the correlation between the first column of U and m?

cor(s$u[,1],m)
## [1] -0.9999998

A:-0.9999998

SVD Exercises #2

In the above question we saw how the first column relates to the mean of the rows of e. Note that if we change these means, the distances between columns do not change. Here some R code showing how changing the means does not change the distances:

newmeans = rnorm(nrow(e)) ##random values we will add to create new means
newe = e+newmeans ##we change the means
sqrt(crossprod(e[,3]-e[,45]))
##          [,1]
## [1,] 152.5662
sqrt(crossprod(newe[,3]-newe[,45]))
##          [,1]
## [1,] 152.5662

So we might as well make the mean of each row 0 since it does not help us approximate the column distances. We will define y as the detrended e and recompute the SVD:

y = e - rowMeans(e)
s = svd(y)
5

5

resid = y - s$u %*% diag(s$d) %*% t(s$v)
max(abs(resid))
## [1] 1.188383e-12
6

6

x=matrix(rep(c(1,2),each=5),5,2)
x
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    2
## [3,]    1    2
## [4,]    1    2
## [5,]    1    2
x*c(1:5)
##      [,1] [,2]
## [1,]    1    2
## [2,]    2    4
## [3,]    3    6
## [4,]    4    8
## [5,]    5   10

Note that the above code is actually equivalent to:

sweep(x,1,1:5,"*")
##      [,1] [,2]
## [1,]    1    2
## [2,]    2    4
## [3,]    3    6
## [4,]    4    8
## [5,]    5   10
7

7

head(diag(s$d)%*%t(s$v))
##            [,1]        [,2]        [,3]       [,4]       [,5]       [,6]
## [1,]  33.367637  41.4831033  41.3231142  40.054572  36.146505  40.214522
## [2,] -27.601399 -36.9912840   3.8035407  -9.754504 -19.703781   5.935035
## [3,] -50.926062 -36.4568742 -36.1291140 -32.010383 -44.331295 -29.773335
## [4,]   4.524792  -1.0473508   0.5466589   4.963027   6.833245   2.071492
## [5,] -21.690513 -10.5313431 -48.1445359 -37.266098 -27.495327 -51.230683
## [6,]  -5.230301  -0.7337037   0.6569955  11.035631   6.149118   1.222308
##           [,7]       [,8]       [,9]       [,10]      [,11]      [,12]
## [1,]  32.59423  42.583783  32.465606  42.5402783  39.373302  39.520781
## [2,] -11.28229   2.647384 -27.595578   1.5585022 -24.119388   8.198632
## [3,] -14.29959 -37.908111 -46.801081 -37.0916036 -44.650695 -26.351326
## [4,]  25.35232   2.140486   4.794053   0.6245207  -4.025119   3.115839
## [5,] -34.41662 -47.687154  -9.856736 -44.5472121 -20.512852 -51.098702
## [6,]  19.98217  -1.410014   8.796039   0.9599773   2.024163   2.140379
##            [,13]       [,14]       [,15]      [,16]      [,17]      [,18]
## [1,]  41.3492974  43.0444694  36.0590381  40.925094 -60.457238 -75.211616
## [2,] -19.9228038   3.8530855 -19.4372820  -5.384144  10.227393  12.114447
## [3,] -41.7314839 -37.8437578 -37.3469638 -44.321504  -5.132005  -4.193173
## [4,]   5.2151606  -0.6371225   0.8709797  -1.360416  54.828739  49.993499
## [5,] -26.8829708 -46.7057159 -29.8633453 -41.059262   6.457505   7.554507
## [6,]   0.3630771  -0.4588111   9.7093570  -5.468645   3.352128   1.254389
##          [,19]      [,20]     [,21]      [,22]       [,23]      [,24]
## [1,] -44.15597 -71.769026 -44.25359 -67.307760 -63.4041693 -78.677496
## [2,]  28.95854   3.312265  22.16562  15.080366  16.6166982  14.365586
## [3,]  26.46644 -26.649087  27.54664  -2.610385  -0.3471538 -13.716782
## [4,]  65.59601  30.222790  60.37822  48.986309  60.4644378  45.680290
## [5,] -20.94215  24.672160 -21.24658   1.839369   1.3876087  13.308978
## [6,]  20.24829  -4.854346  16.41612   4.421392   4.3140449  -1.580885
##           [,25]      [,26]      [,27]      [,28]      [,29]      [,30]
## [1,] -57.572295 -51.236362 -86.192069 -86.246537 -87.373816 -87.971240
## [2,]  11.940728  14.762366  14.729293  10.426813  12.136567  13.584687
## [3,]  -5.635827   2.572384  -6.621586 -14.990118 -17.676922 -12.221112
## [4,]  59.217052  60.766336  37.811851  33.009076  33.628887  46.917434
## [5,]   1.715368  -2.752669   9.481619  20.761441  20.158266  16.485205
## [6,]   2.364528   7.975162  -1.748872  -2.887229  -4.610575  -1.292198
##           [,31]      [,32]      [,33]     [,34]     [,35]      [,36]
## [1,] -94.173237 -68.975716 -77.813134 -84.70253 -75.71193 -77.406183
## [2,]   9.491519   9.127942  10.946127  12.77127  10.39398  12.863036
## [3,] -20.277492 -14.364144 -18.886451 -15.24421 -19.89104 -21.105707
## [4,]  20.373444  50.883197  39.870979  29.58295  37.50556  46.345993
## [5,]  25.763909  14.622713  19.649893  17.91162  16.61512  17.165529
## [6,]  -8.174586  -2.391924  -2.672529  -3.71899  -3.76827  -2.156575
##           [,37]      [,38]       [,39]      [,40]      [,41]      [,42]
## [1,] -75.042931 -75.684512 -64.2192103 -72.507041 -71.516921 -77.897176
## [2,]   4.948577   7.612535  10.9887405   9.769956   8.310735   7.312644
## [3,] -35.528578 -22.388048  -5.3956002 -18.198432 -24.233036 -23.139147
## [4,]  35.975543  38.603160  45.9184759  46.751950  44.010880  29.999380
## [5,]  27.147774  23.570991   5.7929946  18.145949  19.392003  23.704514
## [6,]  -8.121232  -6.048199   0.9269622  -1.784518  -4.475445  -6.759738
##           [,43]       [,44]      [,45]       [,46]       [,47]      [,48]
## [1,] -58.639533 -65.2141149 -70.035000 -73.7757676 -81.0525402 -78.723452
## [2,]   9.326129  11.0749525  15.221963  14.1575678  13.0177523  14.770936
## [3,] -14.214822 -15.9047645  -8.333618  -4.4617641  26.1952147  24.385261
## [4,]  47.612046  36.9834476  49.849995  41.1116866 -42.9035266 -32.929194
## [5,]  10.627387  13.6933536   6.595174   6.3655263 -15.9663464 -16.032088
## [6,]   1.913975  -0.5173279   2.113035   0.2809776  -0.1751373   2.609717
##           [,49]      [,50]       [,51]       [,52]      [,53]       [,54]
## [1,] -92.584069 -94.284715 -83.9090802 -85.5587565 -92.664422 -81.2806187
## [2,]  10.513449  11.331870  12.1360409  13.2063731  10.197738  12.1522335
## [3,]  16.387548  11.252822  22.3108034  20.0190004  19.328898  18.7769469
## [4,] -49.475184 -42.223194 -40.7993584 -45.4622202 -50.411798 -37.5384548
## [5,]  -8.306340  -5.658941 -13.2573022 -11.9229408  -8.582672 -11.2250079
## [6,]  -3.452116  -4.048821  -0.9851206  -0.8062555  -3.158114  -0.2423327
##           [,55]      [,56]      [,57]      [,58]       [,59]      [,60]
## [1,] -90.056835 -81.579261 -86.904904 -89.447061 -83.1749740 -92.143550
## [2,]  11.694871  17.823307  13.281555   9.397437  11.5131214  13.053308
## [3,]   7.615741  14.528072  14.490858  10.528267  26.8460968   9.109598
## [4,] -32.738793 -20.833561 -37.929329 -39.344576 -35.7292839 -28.578189
## [5,]  -3.133610 -12.390794  -7.951112  -5.321080 -17.1183938  -1.917909
## [6,]  -3.848836   2.016966  -2.073287  -4.180562   0.4460951  -3.096856
##           [,61]      [,62]      [,63]      [,64]       [,65]      [,66]
## [1,] -91.456078 -88.887046 -90.233124 -96.744910 -86.6261752 -91.997996
## [2,]  11.462692  11.013860  12.539940   9.674577  14.2762227  12.096475
## [3,]  15.343854  17.705234  23.991978  11.858034  20.4639947  17.439556
## [4,] -38.619974 -46.991836 -45.646917 -50.055940 -35.8553309 -50.872410
## [5,]  -7.713613  -6.639571 -10.831373  -2.144098 -10.9904606  -3.666205
## [6,]  -3.252670  -2.838944  -2.043479  -5.495929  -0.1032774  -3.078567
##            [,67]      [,68]      [,69]      [,70]       [,71]      [,72]
## [1,] -79.4161180 -83.114347 -80.276192 -93.332010 -85.7186866 -90.666677
## [2,]  15.1517286  11.726002  15.941913  10.517398  14.5372751  12.347679
## [3,]  14.4623966  12.004229  24.305786  16.263302  23.9922845  15.236269
## [4,] -29.6245948 -36.001964 -36.183648 -50.590173 -35.5550203 -42.342573
## [5,] -11.9008018  -7.563350 -13.249556  -5.051157 -11.2256703  -6.278001
## [6,]   0.4639882  -2.243848   2.983838  -2.677211   0.8018736  -3.734171
##           [,73]      [,74]     [,75]      [,76]      [,77]      [,78]
## [1,] -87.632737  33.314366  34.28248  32.714225  31.789337  31.369118
## [2,]  11.704497  -8.306140 -37.40025 -36.313390  -5.398339 -32.592026
## [3,]  11.799841 -48.553364 -60.10559 -51.237283 -34.228947 -54.631349
## [4,] -41.688734  -6.546652 -13.81186 -13.245673  -9.158657 -14.825575
## [5,]  -6.613230 -40.432225 -10.08654 -14.575518 -44.787446 -15.059038
## [6,]  -3.659209  -8.391221 -12.53913  -6.311445  -4.592069  -2.853735
##           [,79]      [,80]      [,81]      [,82]     [,83]      [,84]
## [1,]  33.749103  28.381207  41.432926  33.662836  26.60776  26.631732
## [2,] -34.709635 -35.444291 -24.036766   1.336311 -34.24782 -37.927647
## [3,] -64.470246 -66.507667 -40.951957 -37.044633 -62.86029 -68.176073
## [4,] -17.040738 -22.850756  -7.273004  -3.124432 -13.83541  -5.751024
## [5,] -13.387400 -12.456951 -19.145823 -46.788572 -10.09864  -9.124225
## [6,]  -8.389626  -7.064372  -4.283177  -5.738806  -8.34524  -5.714543
##           [,85]      [,86]      [,87]      [,88]      [,89]      [,90]
## [1,]  40.834528  40.514163  46.907726  40.231520  40.105127  40.782618
## [2,] -27.513912 -24.728950 -46.055433 -47.008404 -50.325528 -40.268213
## [3,] -45.459729 -21.357504  35.515171  34.943135  29.844441  37.013588
## [4,] -10.242214  -3.836492   4.260908   5.781422   1.787436   2.807525
## [5,] -10.057771  -8.391107  14.035649  13.189788  15.623377  14.045943
## [6,]  -3.090973   1.515807 -19.394340 -16.115001 -19.773190 -13.337995
##           [,91]      [,92]      [,93]      [,94]      [,95]      [,96]
## [1,]  43.406114  37.085924  36.760032  34.484704  31.432023  35.359492
## [2,] -49.716762 -40.529108 -49.519598 -46.839334 -40.090512 -51.937781
## [3,]  25.559645  28.985812  36.552190  48.857606  57.579230  35.029694
## [4,]  -1.903568  12.375724   2.680657   3.776674  10.933010  -1.627285
## [5,]  14.580099   3.103755   9.966710  10.162979   5.373957  17.336728
## [6,] -20.096101 -14.849499 -22.431974 -22.285589 -15.004971 -21.961573
##           [,97]      [,98]      [,99]     [,100]    [,101]     [,102]
## [1,]  40.646748  39.057892  35.283735  34.842447  37.31653  34.005723
## [2,] -48.975159 -44.751827 -48.952934 -32.405206 -36.24102 -46.997263
## [3,]  46.328410  36.178407  48.040356  56.189465  45.46078  33.546115
## [4,]   1.559785   5.638875   5.452021  11.702125  13.80448   4.231480
## [5,]  17.536384  10.211294  14.866807  -8.154364  -4.43691   7.816503
## [6,] -20.474456 -13.970778 -23.345771 -17.155983 -13.86496 -21.497009
##           [,103]     [,104]     [,105]     [,106]      [,107]     [,108]
## [1,]  36.1632209  33.670727  34.681902  32.871491  35.1441322  31.541488
## [2,] -46.8574689 -47.674143 -48.934565 -25.880847 -53.7877641 -48.703807
## [3,]  43.9073132  40.436640  45.837381  58.396256  36.6482569  33.993728
## [4,]  -0.7209663  10.024784   5.237659  21.756277   0.8467065  12.644877
## [5,]  13.0948583   6.304818   9.561762 -16.204317  15.8793871   6.152819
## [6,] -21.0891764 -22.173333 -26.820179  -8.213545 -25.6033224 -20.161207
##          [,109]     [,110]    [,111]     [,112]     [,113]     [,114]
## [1,]  37.466705  38.977120  35.74300  36.373310  42.655368  44.015622
## [2,] -40.235347 -49.599028 -29.30029 -44.047932 -46.655314 -44.594376
## [3,]  42.260781  30.699038  42.49971  43.396432  36.540274  41.327025
## [4,]   7.536322   4.897427  10.91812   4.636752  -1.412658   3.311874
## [5,]   9.002062  19.361876 -10.55998   9.162079  14.847425  10.063732
## [6,] -15.995086 -20.770830 -12.45712 -18.175875 -17.236909 -15.900752
##          [,115]     [,116]     [,117]     [,118]     [,119]     [,120]
## [1,]  35.426895  41.629626  37.834989  33.764086  35.671487  34.347328
## [2,] -44.598570 -46.008158 -41.892543 -38.008929 -38.842025 -36.300604
## [3,]  48.853194  27.580561  44.965919  41.173175  36.832561  45.044267
## [4,]   7.913213   5.247728   6.599312  13.829884  11.948720  11.692827
## [5,]   8.343913   8.463389   8.738752  -2.001485  -2.790566  -1.172338
## [6,] -24.497354 -16.958048 -16.254917 -16.159725 -15.013996 -11.046586
##           [,121]      [,122]     [,123]    [,124]    [,125]     [,126]
## [1,]  28.4388490  28.8657695  30.322130 68.435864 68.972434 68.2083968
## [2,] -16.5643725 -16.5168836  -1.131207 65.203459 67.924514 65.1255031
## [3,] -17.4005896 -15.7104024  13.283234 11.657900 17.288736 13.8415857
## [4,]  -2.7888013  -3.8353698  10.726409 -1.502951  2.185966  0.5892529
## [5,] -32.0634122 -33.0245579 -47.862244  6.337016  1.643290  4.9071939
## [6,]   0.6584198   0.8709017  14.786558 -6.964477 -3.204043 -5.9392508
##          [,127]     [,128]     [,129]    [,130]    [,131]    [,132]
## [1,]  27.941744  28.725684  27.521909  72.30697 78.263842  73.34954
## [2,] -25.123973 -23.393290 -26.929069  55.65595 67.595255  55.16477
## [3,] -33.413997 -30.169395 -37.623886 -19.19935 -1.470898 -22.28375
## [4,] -11.766005 -10.537170 -14.253977 -15.85260 -6.545949 -18.40324
## [5,] -23.952314 -23.813260 -22.101393  31.33529 10.944371  34.26888
## [6,]  -5.062072  -2.963673  -6.178614 -22.04901 -6.874094 -23.23595
##          [,133]      [,134]      [,135]      [,136]    [,137]    [,138]
## [1,]  29.354389  40.1114167 -64.9830154 -71.3409020 68.006301  68.20969
## [2,] -10.127824  -0.5657104  22.7268600  12.7558427 68.436793  65.78385
## [3,]  -3.354383  -7.1413694  29.3723204   0.6453371 13.456476 -20.11513
## [4,]   2.478723   6.2103351  -0.9485795  20.7933695 -1.689245 -24.04656
## [5,] -38.194417 -34.3383585 -21.1974608   0.8135168  7.978905  42.11802
## [6,]   5.534724   7.1165564  13.7670883   3.1881075 -8.278970 -16.89485
##           [,139]    [,140]    [,141]     [,142]    [,143]    [,144]
## [1,]  70.3674459  72.27429 69.447949 78.7482696 -28.82868 -26.26544
## [2,]  75.2676823  61.84340 77.721061 79.0253794  41.23766  39.50850
## [3,]  -0.4037094 -21.77508 14.702300  0.5757421  65.48720  66.01022
## [4,] -14.5276786 -21.65319 -1.693676 -9.4852261  14.81416  17.42570
## [5,]  26.7780028  36.64010  7.757327 20.7927812 -54.01261 -55.32920
## [6,]  -8.6661990 -15.44135  2.421507  0.1064769  37.77934  37.70582
##         [,145]    [,146]    [,147]    [,148]     [,149]     [,150]
## [1,]  60.17189  64.03324  41.15777  41.47376  30.101725  11.779245
## [2,]  79.64614  83.65719  40.59086  35.43083 -33.310918 -25.136022
## [3,]  63.64943  62.82029  51.22174  36.90235  -6.745666 -21.975715
## [4,]  29.99845  26.69236  34.22890  28.89347  -5.787455 -13.354616
## [5,] -44.44604 -43.83300 -78.30757 -72.37405   2.259030   9.298516
## [6,]  33.12262  34.80677  40.01084  33.67397  14.178404   9.623824
##          [,151]     [,152]     [,153]     [,154]     [,155]        [,156]
## [1,]  29.460157  28.070405  30.926110  20.055650  26.339439  29.059920268
## [2,] -39.394487 -34.995285 -35.976172 -41.091805 -30.446289 -30.358328639
## [3,] -10.180139   2.656477  -7.479663 -15.485143   4.283788  -0.008798855
## [4,] -11.335027  -4.039199 -12.669814  -7.580626  -4.402318  -7.290759338
## [5,]  10.834789   1.275933   9.835466   8.212026   7.278833  -2.980116158
## [6,]   8.511958   9.938846  10.082679   7.123183  10.132298  11.299425693
##          [,157]     [,158]     [,159]     [,160]     [,161]     [,162]
## [1,]  15.456020  15.680297  14.607636   7.486508  12.608877  27.910549
## [2,] -35.024631 -31.902263 -23.354183 -35.326185 -26.395047 -31.837675
## [3,] -35.416566 -32.715015 -14.094522 -39.911163 -34.039330 -28.351016
## [4,]  -6.962547  -2.090082   1.691843  -5.639383  -4.526072  -2.860365
## [5,]  10.067893   5.851227  -5.564587  11.617660   5.496256  12.290105
## [6,]   7.335580  12.067746  11.829491   7.991133   6.636579   6.922362
##          [,163]      [,164]    [,165]    [,166]    [,167]      [,168]
## [1,]  24.703197 85.00936643 81.224927 80.541713  83.31780 85.00936643
## [2,] -35.046177 83.19134243 88.304562 78.209180  83.78608 83.19134243
## [3,] -32.814228  0.06082154 -1.280238  1.463048 -14.93206  0.06082154
## [4,]  -5.950835 -2.58632020 -6.366312 -3.378867 -14.96246 -2.58632020
## [5,]   9.297420 20.22444755 22.001401 18.966904  32.37009 20.22444755
## [6,]   6.972852 -6.66505693 -8.463999 -5.419126 -13.19450 -6.66505693
##         [,169]    [,170]    [,171]     [,172]     [,173]    [,174]
## [1,] 80.541713  83.31780 81.224927  78.120610  80.591847 78.733247
## [2,] 78.209180  83.78608 88.304562  85.483368  87.517638 95.458838
## [3,]  1.463048 -14.93206 -1.280238  -6.650727  -5.974910  8.931424
## [4,] -3.378867 -14.96246 -6.366312 -10.897758 -13.303653 -3.712412
## [5,] 18.966904  32.37009 22.001401  21.332763  25.249790 11.860348
## [6,] -5.419126 -13.19450 -8.463999  -9.813258  -9.396266 -4.729341
##          [,175]      [,176]     [,177]     [,178]     [,179]      [,180]
## [1,]  83.268211 -24.4119520 -53.368473 -49.901942 -40.819934 -37.1520758
## [2,]  91.883614  -7.9618118  -5.207480 -13.717189 -14.424506 -10.9835337
## [3,]  -0.806764 -30.6437611 -33.272994 -50.925211 -45.942245 -38.4919417
## [4,] -12.998304   3.0841163  -4.833968  -4.968734  -5.143056   0.6631418
## [5,]  20.569930  24.5018735  26.602936  38.158191  31.630936  24.3114681
## [6,]  -7.205963  -0.3620428  -5.450260 -10.981845  -6.737838  -4.6758068
##          [,181]     [,182]     [,183]    [,184]     [,185]    [,186]
## [1,] -89.061675 -79.608354 -99.826173  30.35769  24.929772  39.67329
## [2,]  10.379737  -1.401949   6.971805 -58.47980 -51.726650 -44.62706
## [3,]   1.412092 -25.064365  -6.961008 -12.30664  -6.589041  10.25816
## [4,] -41.059393 -21.325554 -49.302884 -25.87458 -20.229914 -18.87999
## [5,]   3.968298  14.846073  12.289889  66.68990  57.532339  49.29995
## [6,]  -2.558079 -12.710234  -6.324220  86.66681  80.026494  91.76388
##         [,187]    [,188]    [,189]
## [1,]  43.96331  45.48406  51.89815
## [2,] -36.27300 -24.41609 -22.01634
## [3,]  11.02132  21.33785  27.14429
## [4,] -17.87021 -15.08956 -10.94304
## [5,]  48.88683  36.52675  34.96074
## [6,]  99.18317 101.97758 108.37173
#s$d %*% t(s$v)1 # incorrect
head(s$d * t(s$v))
##            [,1]        [,2]        [,3]       [,4]       [,5]       [,6]
## [1,]  33.367637  41.4831033  41.3231142  40.054572  36.146505  40.214522
## [2,] -27.601399 -36.9912840   3.8035407  -9.754504 -19.703781   5.935035
## [3,] -50.926062 -36.4568742 -36.1291140 -32.010383 -44.331295 -29.773335
## [4,]   4.524792  -1.0473508   0.5466589   4.963027   6.833245   2.071492
## [5,] -21.690513 -10.5313431 -48.1445359 -37.266098 -27.495327 -51.230683
## [6,]  -5.230301  -0.7337037   0.6569955  11.035631   6.149118   1.222308
##           [,7]       [,8]       [,9]       [,10]      [,11]      [,12]
## [1,]  32.59423  42.583783  32.465606  42.5402783  39.373302  39.520781
## [2,] -11.28229   2.647384 -27.595578   1.5585022 -24.119388   8.198632
## [3,] -14.29959 -37.908111 -46.801081 -37.0916036 -44.650695 -26.351326
## [4,]  25.35232   2.140486   4.794053   0.6245207  -4.025119   3.115839
## [5,] -34.41662 -47.687154  -9.856736 -44.5472121 -20.512852 -51.098702
## [6,]  19.98217  -1.410014   8.796039   0.9599773   2.024163   2.140379
##            [,13]       [,14]       [,15]      [,16]      [,17]      [,18]
## [1,]  41.3492974  43.0444694  36.0590381  40.925094 -60.457238 -75.211616
## [2,] -19.9228038   3.8530855 -19.4372820  -5.384144  10.227393  12.114447
## [3,] -41.7314839 -37.8437578 -37.3469638 -44.321504  -5.132005  -4.193173
## [4,]   5.2151606  -0.6371225   0.8709797  -1.360416  54.828739  49.993499
## [5,] -26.8829708 -46.7057159 -29.8633453 -41.059262   6.457505   7.554507
## [6,]   0.3630771  -0.4588111   9.7093570  -5.468645   3.352128   1.254389
##          [,19]      [,20]     [,21]      [,22]       [,23]      [,24]
## [1,] -44.15597 -71.769026 -44.25359 -67.307760 -63.4041693 -78.677496
## [2,]  28.95854   3.312265  22.16562  15.080366  16.6166982  14.365586
## [3,]  26.46644 -26.649087  27.54664  -2.610385  -0.3471538 -13.716782
## [4,]  65.59601  30.222790  60.37822  48.986309  60.4644378  45.680290
## [5,] -20.94215  24.672160 -21.24658   1.839369   1.3876087  13.308978
## [6,]  20.24829  -4.854346  16.41612   4.421392   4.3140449  -1.580885
##           [,25]      [,26]      [,27]      [,28]      [,29]      [,30]
## [1,] -57.572295 -51.236362 -86.192069 -86.246537 -87.373816 -87.971240
## [2,]  11.940728  14.762366  14.729293  10.426813  12.136567  13.584687
## [3,]  -5.635827   2.572384  -6.621586 -14.990118 -17.676922 -12.221112
## [4,]  59.217052  60.766336  37.811851  33.009076  33.628887  46.917434
## [5,]   1.715368  -2.752669   9.481619  20.761441  20.158266  16.485205
## [6,]   2.364528   7.975162  -1.748872  -2.887229  -4.610575  -1.292198
##           [,31]      [,32]      [,33]     [,34]     [,35]      [,36]
## [1,] -94.173237 -68.975716 -77.813134 -84.70253 -75.71193 -77.406183
## [2,]   9.491519   9.127942  10.946127  12.77127  10.39398  12.863036
## [3,] -20.277492 -14.364144 -18.886451 -15.24421 -19.89104 -21.105707
## [4,]  20.373444  50.883197  39.870979  29.58295  37.50556  46.345993
## [5,]  25.763909  14.622713  19.649893  17.91162  16.61512  17.165529
## [6,]  -8.174586  -2.391924  -2.672529  -3.71899  -3.76827  -2.156575
##           [,37]      [,38]       [,39]      [,40]      [,41]      [,42]
## [1,] -75.042931 -75.684512 -64.2192103 -72.507041 -71.516921 -77.897176
## [2,]   4.948577   7.612535  10.9887405   9.769956   8.310735   7.312644
## [3,] -35.528578 -22.388048  -5.3956002 -18.198432 -24.233036 -23.139147
## [4,]  35.975543  38.603160  45.9184759  46.751950  44.010880  29.999380
## [5,]  27.147774  23.570991   5.7929946  18.145949  19.392003  23.704514
## [6,]  -8.121232  -6.048199   0.9269622  -1.784518  -4.475445  -6.759738
##           [,43]       [,44]      [,45]       [,46]       [,47]      [,48]
## [1,] -58.639533 -65.2141149 -70.035000 -73.7757676 -81.0525402 -78.723452
## [2,]   9.326129  11.0749525  15.221963  14.1575678  13.0177523  14.770936
## [3,] -14.214822 -15.9047645  -8.333618  -4.4617641  26.1952147  24.385261
## [4,]  47.612046  36.9834476  49.849995  41.1116866 -42.9035266 -32.929194
## [5,]  10.627387  13.6933536   6.595174   6.3655263 -15.9663464 -16.032088
## [6,]   1.913975  -0.5173279   2.113035   0.2809776  -0.1751373   2.609717
##           [,49]      [,50]       [,51]       [,52]      [,53]       [,54]
## [1,] -92.584069 -94.284715 -83.9090802 -85.5587565 -92.664422 -81.2806187
## [2,]  10.513449  11.331870  12.1360409  13.2063731  10.197738  12.1522335
## [3,]  16.387548  11.252822  22.3108034  20.0190004  19.328898  18.7769469
## [4,] -49.475184 -42.223194 -40.7993584 -45.4622202 -50.411798 -37.5384548
## [5,]  -8.306340  -5.658941 -13.2573022 -11.9229408  -8.582672 -11.2250079
## [6,]  -3.452116  -4.048821  -0.9851206  -0.8062555  -3.158114  -0.2423327
##           [,55]      [,56]      [,57]      [,58]       [,59]      [,60]
## [1,] -90.056835 -81.579261 -86.904904 -89.447061 -83.1749740 -92.143550
## [2,]  11.694871  17.823307  13.281555   9.397437  11.5131214  13.053308
## [3,]   7.615741  14.528072  14.490858  10.528267  26.8460968   9.109598
## [4,] -32.738793 -20.833561 -37.929329 -39.344576 -35.7292839 -28.578189
## [5,]  -3.133610 -12.390794  -7.951112  -5.321080 -17.1183938  -1.917909
## [6,]  -3.848836   2.016966  -2.073287  -4.180562   0.4460951  -3.096856
##           [,61]      [,62]      [,63]      [,64]       [,65]      [,66]
## [1,] -91.456078 -88.887046 -90.233124 -96.744910 -86.6261752 -91.997996
## [2,]  11.462692  11.013860  12.539940   9.674577  14.2762227  12.096475
## [3,]  15.343854  17.705234  23.991978  11.858034  20.4639947  17.439556
## [4,] -38.619974 -46.991836 -45.646917 -50.055940 -35.8553309 -50.872410
## [5,]  -7.713613  -6.639571 -10.831373  -2.144098 -10.9904606  -3.666205
## [6,]  -3.252670  -2.838944  -2.043479  -5.495929  -0.1032774  -3.078567
##            [,67]      [,68]      [,69]      [,70]       [,71]      [,72]
## [1,] -79.4161180 -83.114347 -80.276192 -93.332010 -85.7186866 -90.666677
## [2,]  15.1517286  11.726002  15.941913  10.517398  14.5372751  12.347679
## [3,]  14.4623966  12.004229  24.305786  16.263302  23.9922845  15.236269
## [4,] -29.6245948 -36.001964 -36.183648 -50.590173 -35.5550203 -42.342573
## [5,] -11.9008018  -7.563350 -13.249556  -5.051157 -11.2256703  -6.278001
## [6,]   0.4639882  -2.243848   2.983838  -2.677211   0.8018736  -3.734171
##           [,73]      [,74]     [,75]      [,76]      [,77]      [,78]
## [1,] -87.632737  33.314366  34.28248  32.714225  31.789337  31.369118
## [2,]  11.704497  -8.306140 -37.40025 -36.313390  -5.398339 -32.592026
## [3,]  11.799841 -48.553364 -60.10559 -51.237283 -34.228947 -54.631349
## [4,] -41.688734  -6.546652 -13.81186 -13.245673  -9.158657 -14.825575
## [5,]  -6.613230 -40.432225 -10.08654 -14.575518 -44.787446 -15.059038
## [6,]  -3.659209  -8.391221 -12.53913  -6.311445  -4.592069  -2.853735
##           [,79]      [,80]      [,81]      [,82]     [,83]      [,84]
## [1,]  33.749103  28.381207  41.432926  33.662836  26.60776  26.631732
## [2,] -34.709635 -35.444291 -24.036766   1.336311 -34.24782 -37.927647
## [3,] -64.470246 -66.507667 -40.951957 -37.044633 -62.86029 -68.176073
## [4,] -17.040738 -22.850756  -7.273004  -3.124432 -13.83541  -5.751024
## [5,] -13.387400 -12.456951 -19.145823 -46.788572 -10.09864  -9.124225
## [6,]  -8.389626  -7.064372  -4.283177  -5.738806  -8.34524  -5.714543
##           [,85]      [,86]      [,87]      [,88]      [,89]      [,90]
## [1,]  40.834528  40.514163  46.907726  40.231520  40.105127  40.782618
## [2,] -27.513912 -24.728950 -46.055433 -47.008404 -50.325528 -40.268213
## [3,] -45.459729 -21.357504  35.515171  34.943135  29.844441  37.013588
## [4,] -10.242214  -3.836492   4.260908   5.781422   1.787436   2.807525
## [5,] -10.057771  -8.391107  14.035649  13.189788  15.623377  14.045943
## [6,]  -3.090973   1.515807 -19.394340 -16.115001 -19.773190 -13.337995
##           [,91]      [,92]      [,93]      [,94]      [,95]      [,96]
## [1,]  43.406114  37.085924  36.760032  34.484704  31.432023  35.359492
## [2,] -49.716762 -40.529108 -49.519598 -46.839334 -40.090512 -51.937781
## [3,]  25.559645  28.985812  36.552190  48.857606  57.579230  35.029694
## [4,]  -1.903568  12.375724   2.680657   3.776674  10.933010  -1.627285
## [5,]  14.580099   3.103755   9.966710  10.162979   5.373957  17.336728
## [6,] -20.096101 -14.849499 -22.431974 -22.285589 -15.004971 -21.961573
##           [,97]      [,98]      [,99]     [,100]    [,101]     [,102]
## [1,]  40.646748  39.057892  35.283735  34.842447  37.31653  34.005723
## [2,] -48.975159 -44.751827 -48.952934 -32.405206 -36.24102 -46.997263
## [3,]  46.328410  36.178407  48.040356  56.189465  45.46078  33.546115
## [4,]   1.559785   5.638875   5.452021  11.702125  13.80448   4.231480
## [5,]  17.536384  10.211294  14.866807  -8.154364  -4.43691   7.816503
## [6,] -20.474456 -13.970778 -23.345771 -17.155983 -13.86496 -21.497009
##           [,103]     [,104]     [,105]     [,106]      [,107]     [,108]
## [1,]  36.1632209  33.670727  34.681902  32.871491  35.1441322  31.541488
## [2,] -46.8574689 -47.674143 -48.934565 -25.880847 -53.7877641 -48.703807
## [3,]  43.9073132  40.436640  45.837381  58.396256  36.6482569  33.993728
## [4,]  -0.7209663  10.024784   5.237659  21.756277   0.8467065  12.644877
## [5,]  13.0948583   6.304818   9.561762 -16.204317  15.8793871   6.152819
## [6,] -21.0891764 -22.173333 -26.820179  -8.213545 -25.6033224 -20.161207
##          [,109]     [,110]    [,111]     [,112]     [,113]     [,114]
## [1,]  37.466705  38.977120  35.74300  36.373310  42.655368  44.015622
## [2,] -40.235347 -49.599028 -29.30029 -44.047932 -46.655314 -44.594376
## [3,]  42.260781  30.699038  42.49971  43.396432  36.540274  41.327025
## [4,]   7.536322   4.897427  10.91812   4.636752  -1.412658   3.311874
## [5,]   9.002062  19.361876 -10.55998   9.162079  14.847425  10.063732
## [6,] -15.995086 -20.770830 -12.45712 -18.175875 -17.236909 -15.900752
##          [,115]     [,116]     [,117]     [,118]     [,119]     [,120]
## [1,]  35.426895  41.629626  37.834989  33.764086  35.671487  34.347328
## [2,] -44.598570 -46.008158 -41.892543 -38.008929 -38.842025 -36.300604
## [3,]  48.853194  27.580561  44.965919  41.173175  36.832561  45.044267
## [4,]   7.913213   5.247728   6.599312  13.829884  11.948720  11.692827
## [5,]   8.343913   8.463389   8.738752  -2.001485  -2.790566  -1.172338
## [6,] -24.497354 -16.958048 -16.254917 -16.159725 -15.013996 -11.046586
##           [,121]      [,122]     [,123]    [,124]    [,125]     [,126]
## [1,]  28.4388490  28.8657695  30.322130 68.435864 68.972434 68.2083968
## [2,] -16.5643725 -16.5168836  -1.131207 65.203459 67.924514 65.1255031
## [3,] -17.4005896 -15.7104024  13.283234 11.657900 17.288736 13.8415857
## [4,]  -2.7888013  -3.8353698  10.726409 -1.502951  2.185966  0.5892529
## [5,] -32.0634122 -33.0245579 -47.862244  6.337016  1.643290  4.9071939
## [6,]   0.6584198   0.8709017  14.786558 -6.964477 -3.204043 -5.9392508
##          [,127]     [,128]     [,129]    [,130]    [,131]    [,132]
## [1,]  27.941744  28.725684  27.521909  72.30697 78.263842  73.34954
## [2,] -25.123973 -23.393290 -26.929069  55.65595 67.595255  55.16477
## [3,] -33.413997 -30.169395 -37.623886 -19.19935 -1.470898 -22.28375
## [4,] -11.766005 -10.537170 -14.253977 -15.85260 -6.545949 -18.40324
## [5,] -23.952314 -23.813260 -22.101393  31.33529 10.944371  34.26888
## [6,]  -5.062072  -2.963673  -6.178614 -22.04901 -6.874094 -23.23595
##          [,133]      [,134]      [,135]      [,136]    [,137]    [,138]
## [1,]  29.354389  40.1114167 -64.9830154 -71.3409020 68.006301  68.20969
## [2,] -10.127824  -0.5657104  22.7268600  12.7558427 68.436793  65.78385
## [3,]  -3.354383  -7.1413694  29.3723204   0.6453371 13.456476 -20.11513
## [4,]   2.478723   6.2103351  -0.9485795  20.7933695 -1.689245 -24.04656
## [5,] -38.194417 -34.3383585 -21.1974608   0.8135168  7.978905  42.11802
## [6,]   5.534724   7.1165564  13.7670883   3.1881075 -8.278970 -16.89485
##           [,139]    [,140]    [,141]     [,142]    [,143]    [,144]
## [1,]  70.3674459  72.27429 69.447949 78.7482696 -28.82868 -26.26544
## [2,]  75.2676823  61.84340 77.721061 79.0253794  41.23766  39.50850
## [3,]  -0.4037094 -21.77508 14.702300  0.5757421  65.48720  66.01022
## [4,] -14.5276786 -21.65319 -1.693676 -9.4852261  14.81416  17.42570
## [5,]  26.7780028  36.64010  7.757327 20.7927812 -54.01261 -55.32920
## [6,]  -8.6661990 -15.44135  2.421507  0.1064769  37.77934  37.70582
##         [,145]    [,146]    [,147]    [,148]     [,149]     [,150]
## [1,]  60.17189  64.03324  41.15777  41.47376  30.101725  11.779245
## [2,]  79.64614  83.65719  40.59086  35.43083 -33.310918 -25.136022
## [3,]  63.64943  62.82029  51.22174  36.90235  -6.745666 -21.975715
## [4,]  29.99845  26.69236  34.22890  28.89347  -5.787455 -13.354616
## [5,] -44.44604 -43.83300 -78.30757 -72.37405   2.259030   9.298516
## [6,]  33.12262  34.80677  40.01084  33.67397  14.178404   9.623824
##          [,151]     [,152]     [,153]     [,154]     [,155]        [,156]
## [1,]  29.460157  28.070405  30.926110  20.055650  26.339439  29.059920268
## [2,] -39.394487 -34.995285 -35.976172 -41.091805 -30.446289 -30.358328639
## [3,] -10.180139   2.656477  -7.479663 -15.485143   4.283788  -0.008798855
## [4,] -11.335027  -4.039199 -12.669814  -7.580626  -4.402318  -7.290759338
## [5,]  10.834789   1.275933   9.835466   8.212026   7.278833  -2.980116158
## [6,]   8.511958   9.938846  10.082679   7.123183  10.132298  11.299425693
##          [,157]     [,158]     [,159]     [,160]     [,161]     [,162]
## [1,]  15.456020  15.680297  14.607636   7.486508  12.608877  27.910549
## [2,] -35.024631 -31.902263 -23.354183 -35.326185 -26.395047 -31.837675
## [3,] -35.416566 -32.715015 -14.094522 -39.911163 -34.039330 -28.351016
## [4,]  -6.962547  -2.090082   1.691843  -5.639383  -4.526072  -2.860365
## [5,]  10.067893   5.851227  -5.564587  11.617660   5.496256  12.290105
## [6,]   7.335580  12.067746  11.829491   7.991133   6.636579   6.922362
##          [,163]      [,164]    [,165]    [,166]    [,167]      [,168]
## [1,]  24.703197 85.00936643 81.224927 80.541713  83.31780 85.00936643
## [2,] -35.046177 83.19134243 88.304562 78.209180  83.78608 83.19134243
## [3,] -32.814228  0.06082154 -1.280238  1.463048 -14.93206  0.06082154
## [4,]  -5.950835 -2.58632020 -6.366312 -3.378867 -14.96246 -2.58632020
## [5,]   9.297420 20.22444755 22.001401 18.966904  32.37009 20.22444755
## [6,]   6.972852 -6.66505693 -8.463999 -5.419126 -13.19450 -6.66505693
##         [,169]    [,170]    [,171]     [,172]     [,173]    [,174]
## [1,] 80.541713  83.31780 81.224927  78.120610  80.591847 78.733247
## [2,] 78.209180  83.78608 88.304562  85.483368  87.517638 95.458838
## [3,]  1.463048 -14.93206 -1.280238  -6.650727  -5.974910  8.931424
## [4,] -3.378867 -14.96246 -6.366312 -10.897758 -13.303653 -3.712412
## [5,] 18.966904  32.37009 22.001401  21.332763  25.249790 11.860348
## [6,] -5.419126 -13.19450 -8.463999  -9.813258  -9.396266 -4.729341
##          [,175]      [,176]     [,177]     [,178]     [,179]      [,180]
## [1,]  83.268211 -24.4119520 -53.368473 -49.901942 -40.819934 -37.1520758
## [2,]  91.883614  -7.9618118  -5.207480 -13.717189 -14.424506 -10.9835337
## [3,]  -0.806764 -30.6437611 -33.272994 -50.925211 -45.942245 -38.4919417
## [4,] -12.998304   3.0841163  -4.833968  -4.968734  -5.143056   0.6631418
## [5,]  20.569930  24.5018735  26.602936  38.158191  31.630936  24.3114681
## [6,]  -7.205963  -0.3620428  -5.450260 -10.981845  -6.737838  -4.6758068
##          [,181]     [,182]     [,183]    [,184]     [,185]    [,186]
## [1,] -89.061675 -79.608354 -99.826173  30.35769  24.929772  39.67329
## [2,]  10.379737  -1.401949   6.971805 -58.47980 -51.726650 -44.62706
## [3,]   1.412092 -25.064365  -6.961008 -12.30664  -6.589041  10.25816
## [4,] -41.059393 -21.325554 -49.302884 -25.87458 -20.229914 -18.87999
## [5,]   3.968298  14.846073  12.289889  66.68990  57.532339  49.29995
## [6,]  -2.558079 -12.710234  -6.324220  86.66681  80.026494  91.76388
##         [,187]    [,188]    [,189]
## [1,]  43.96331  45.48406  51.89815
## [2,] -36.27300 -24.41609 -22.01634
## [3,]  11.02132  21.33785  27.14429
## [4,] -17.87021 -15.08956 -10.94304
## [5,]  48.88683  36.52675  34.96074
## [6,]  99.18317 101.97758 108.37173
#t(s$d * s$v) # incorrect
#s$v * s$d  # incorrect

SVD Exercises #3

8

8

A:tcrossprod(s$u,vd)

Explanation:Order of operations in R go left to right and s\(u %*% s\)d is multiplying non-conformable matrices

SVD Exercises #4

9

9

z = s$d * t(s$v)

sqrt(crossprod(e[,3]-e[,45]))
##          [,1]
## [1,] 152.5662
sqrt(crossprod(y[,3]-y[,45]))
##          [,1]
## [1,] 152.5662
sqrt(crossprod(z[,3]-z[,45]))
##          [,1]
## [1,] 152.5662

Note that the columns z have 189 entries, compared to 22,215 for e.

What is the difference (in absolute value) between the actual distance sqrt(crossprod(e[,3]-e[,45])) and the approximation using only two dimension of z

realdistance = sqrt(crossprod(e[,3]-e[,45]))
approxdistance = sqrt(crossprod(z[1:2,3]-z[1:2,45]))
abs(realdistance - approxdistance)
##          [,1]
## [1,] 40.62416

A:40.62416

SVD Exercises #5

What is the minium number of dimensions we need to use for the approximation in SVD Exercises #4 to be within 10% or less?

ks = 1:189
realdistance = sqrt(crossprod(e[,3]-e[,45]))
approxdistances = sapply(ks,function(k){
    sqrt(crossprod(z[1:k,3,drop=FALSE]-z[1:k,45,drop=FALSE] )) 
  })
percentdiff = 100*abs(approxdistances - realdistance)/realdistance
plot(ks,percentdiff) ##take a look

min(ks[which(percentdiff < 10)])
## [1] 7

A:7

SVD Exercises #6

Compute distances between sample 3 and all other samples:

distances = sqrt(apply(e[,-3]-e[,3],2,crossprod))

Recompute this distance using the 2 dimensional approximation.

What is the Spearman correlation between this approximate distance and the actual distance?

approxdistances = sqrt(apply(z[1:2,-3]-z[1:2,3],2,crossprod))
plot(distances,approxdistances) ##take a look

cor(distances,approxdistances,method="spearman")
## [1] 0.8598592
#Note that this shows how just two dimensions can be useful to get an idea about the actual distances.

A:0.8598592

Multi-Dimensional Scaling Plots (MDS)

http://genomicsclass.github.io/book/pages/mds.html

We will motivate multi-dimensional scaling (MDS) plots with a gene expression example. To simplify the illustration we will only consider three tissues:

library(rafalib)
library(tissuesGeneExpression)
data(tissuesGeneExpression)
colind <- tissue%in%c("kidney","colon","liver")
mat <- e[,colind]
group <- factor(tissue[colind])
dim(mat)
## [1] 22215    99

As an exploratory step, we wish to know if gene expression profiles stored in the columns of mat show more similarity between tissues than across tissues. Unfortunately, as mentioned above, we can’t plot multi-dimensional points. In general, we prefer two-dimensional plots, but making plots for every pair of genes or every pair of samples is not practical. MDS plots become a powerful tool in this situation.

The math behind MDS

Now that we know about SVD and matrix algebra, understanding MDS is relatively straightforward. For illustrative purposes let’s consider the SVD decomposition:

\[\mathbf{Y} = \mathbf{UDV}^\top\]

and assume that the sum of squares of the first two columns \(\mathbf{U^\top Y=DV^\top}\) is much larger than sum of squares of all other columns. This can be written as: \(d_1+ d_2 \gg d_3 + \dots + d_n\) with \(d_i\) the i-th entry of the \(\mathbf{D}\) matrix. When this happens, we then have:

\[\mathbf{Y}\approx [\mathbf{U}_1 \mathbf{U}_2] \begin{pmatrix} d_{1}&0\\ 0&d_{2}\\ \end{pmatrix} [\mathbf{V}_1 \mathbf{V}_2]^\top \]

This implies that column \(i\) is approximately:

\[ \mathbf{Y}_i \approx [\mathbf{U}_1 \mathbf{U}_2] \begin{pmatrix} d_{1}&0\\ 0&d_{2}\\ \end{pmatrix} \begin{pmatrix} v_{i,1}\\ v_{i,2}\\ \end{pmatrix} = [\mathbf{U}_1 \mathbf{U}_2] \begin{pmatrix} d_{1} v_{i,1}\\ d_{2} v_{i,2} \end{pmatrix} \]

If we define the following two dimensional vector…

\[\mathbf{Z}_i=\begin{pmatrix} d_{1} v_{i,1}\\ d_{2} v_{i,2} \end{pmatrix} \]

… then

\[ \begin{align*} (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) &\approx \left\{ [\mathbf{U}_1 \mathbf{U}_2] (\mathbf{Z}_i-\mathbf{Z}_j) \right\}^\top \left\{[\mathbf{U}_1 \mathbf{U}_2] (\mathbf{Z}_i-\mathbf{Z}_j)\right\}\\ &= (\mathbf{Z}_i-\mathbf{Z}_j)^\top [\mathbf{U}_1 \mathbf{U}_2]^\top [\mathbf{U}_1 \mathbf{U}_2] (\mathbf{Z}_i-\mathbf{Z}_j) \\ &=(\mathbf{Z}_i-\mathbf{Z}_j)^\top(\mathbf{Z}_i-\mathbf{Z}_j)\\ &=(Z_{i,1}-Z_{j,1})^2 + (Z_{i,2}-Z_{j,2})^2 \end{align*} \]

This derivation tells us that the distance between samples \(i\) and \(j\) is approximated by the distance between two dimensional points.

\[ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) \approx (Z_{i,1}-Z_{j,1})^2 + (Z_{i,2}-Z_{j,2})^2 \]

Because \(Z\) is a two dimensional vector, we can visualize the distances between each sample by plotting \(\mathbf{Z}_1\) versus \(\mathbf{Z}_2\) and visually inspect the distance between points. Here is this plot for our example dataset:

s <- svd(mat-rowMeans(mat))

z<-diag(s$d[1:2]) %*% t(s$v[,1:2])
dim(z)
## [1]  2 99
z<-t(z)
plot(z)
Multi-dimensional scaling (MDS) plot for tissue gene expression data.

Multi-dimensional scaling (MDS) plot for tissue gene expression data.

library(rafalib)
mypar(1,1)
plot(z[,1],z[,2],pch=21,bg=as.numeric(group))
legend("bottomright",levels(group),col=seq(along=levels(group)),pch=15,cex=1.5)
Multi-dimensional scaling (MDS) plot for tissue gene expression data.

Multi-dimensional scaling (MDS) plot for tissue gene expression data.

PC1 <- s$d[1]*s$v[,1]
PC2 <- s$d[2]*s$v[,2]
mypar(1,1)
plot(PC1,PC2,pch=21,bg=as.numeric(group))
legend("bottomright",levels(group),col=seq(along=levels(group)),pch=15,cex=1.5)
Multi-dimensional scaling (MDS) plot for tissue gene expression data.

Multi-dimensional scaling (MDS) plot for tissue gene expression data.

Note that the points separate by tissue type as expected. Now the accuracy of the approximation above depends on the proportion of variance explained by the first two principal components. As we showed above, we can quickly see this by plotting the variance explained plot:

plot(s$d^2/sum(s$d^2))
Variance explained for each principal component.

Variance explained for each principal component.

Although the first two PCs explain over 50% of the variability, there is plenty of information that this plot does not show. However, it is an incredibly useful plot for obtaining, via visualization, a general idea of the distance between points. Also, notice that we can plot other dimensions as well to search for patterns. Here are the 3rd and 4th PCs:

PC3 <- s$d[3]*s$v[,3]
PC4 <- s$d[4]*s$v[,4]
mypar(1,1)
plot(PC3,PC4,pch=21,bg=as.numeric(group))
legend("bottomright",levels(group),col=seq(along=levels(group)),pch=15,cex=1.5)
Third and fourth principal components.

Third and fourth principal components.

Note that the 4th PC shows a strong separation within the kidney samples. Later we will learn about batch effects, which might explain this finding.

cmdscale

Although we used the svd functions above, there is a special function that is specifically made for MDS plots. It takes a distance object as an argument and then uses principal component analysis to provide the best approximation to this distance that can be obtained with \(k\) dimensions. This function is more efficient because one does not have to perform the full SVD, which can be time consuming. By default it returns two dimensions, but we can change that through the parameter k which defaults to 2.

d <- dist(t(mat))
mds <- cmdscale(d)

mypar()
plot(mds[,1],mds[,2],bg=as.numeric(group),pch=21,
     xlab="First dimension",ylab="Second dimension")
legend("bottomleft",levels(group),col=seq(along=levels(group)),pch=15)

MDS computed with cmdscale function. These two approaches are equivalent up to an arbitrary sign change.

mypar(1,2)
for(i in 1:2){
  plot(mds[,i],s$d[i]*s$v[,i],main=paste("PC",i))
  b = ifelse( cor(mds[,i],s$v[,i]) > 0, 1, -1)
  abline(0,b) ##b is 1 or -1 depending on the arbitrary sign "flip"
}
Comparison of MDS first two PCs to SVD first two PCs.

Comparison of MDS first two PCs to SVD first two PCs.

** Why the arbitrary sign?**

The SVD is not unique because we can multiply any column of \(\mathbf{V}\) by -1 as long as we multiply the sample column of \(\mathbf{U}\) by -1. We can see this immediately by noting that:

\[ \mathbf{-1UD(-1)V}^\top = \mathbf{UDV}^\top \]

Why we substract the mean

In all calculations above we subtract the row means before we compute the singular value decomposition. If what we are trying to do is approximate the distance between columns, the distance between \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) is the same as the distance between \(\mathbf{Y}_i- \mathbf{\mu}\) and \(\mathbf{Y}_j - \mathbf{\mu}\) since the \(\mu\) cancels out when computing said distance:

\[ \left\{ ( \mathbf{Y}_i- \mathbf{\mu} ) - ( \mathbf{Y}_j - \mathbf{\mu} ) \right\}^\top \left\{ (\mathbf{Y}_i- \mathbf{\mu}) - (\mathbf{Y}_j - \mathbf{\mu} ) \right\} = \left\{ \mathbf{Y}_i- \mathbf{Y}_j \right\}^\top \left\{ \mathbf{Y}_i - \mathbf{Y}_j \right\} \]

Because removing the row averages reduces the total variation, it can only make the SVD approximation better.

MDS Exercises

http://genomicsclass.github.io/book/pages/mds_exercises.html

For the following queesions use the data loaded with:

library(tissuesGeneExpression) data(tissuesGeneExpression)

MDS Exercises #1

n these exercise we will demonstrate the relantionship between the SVD and the output of mdscale, the function in R that performs MDS.

Using the z we computed in SVD Exercises #4

y = e - rowMeans(e)
s = svd(y)
z = s$d * t(s$v)

we can make an mds plot

library(rafalib)
ftissue = factor(tissue)
mypar(1,1)
plot(z[1,],z[2,],col=as.numeric(ftissue))
legend("topleft",levels(ftissue),col=seq_along(ftissue),pch=1)

Now run the function cmdscale on the original data

d = dist(t(e))
mds = cmdscale(d)

Q:What is the correlation between the first row of z and the first column in mds?

cor(z[1,],mds[,1])
## [1] -1

A:-1

MDS Exercises #2

What is the correlation between the second row of z and the second column od mds?

cor(z[2,],mds[,2])
## [1] -1

A:-1

MDS Exercises #3

Note that the mds plot is not the same:

library(rafalib)
ftissue = factor(tissue)
mypar(1,2)
plot(z[1,],z[2,],col=as.numeric(ftissue))
legend("topleft",levels(ftissue),col=seq_along(ftissue),pch=1)
plot(mds[,1],mds[,2],col=as.numeric(ftissue))

Given the answer to MDS Exercises #1, what do we have to do to z[1,] and z[2,] to get a practically identical plot?

A:multiply z[1,] and z[2,] by -1

Explanation:From the answer to the previous question we see that we have a flipped sign in both dimensions.

MDS Exercises #4

Load the following dataset

library(GSE5859Subset)
data(GSE5859Subset)
dim(geneExpression)
## [1] 8793   24
write.csv(geneExpression,"geneExpression.csv")

Compute the svd and compute z

s = svd(geneExpression-rowMeans(geneExpression))
z = s$d * t(s$v)

Q:Which dimension of z most correlates with the outcome sampleInfo$group?

which.max(cor(sampleInfo$g,t(z)))
## [1] 1

A:1

MDS Exercises #5

Load the following dataset

library(GSE5859Subset)
data(GSE5859Subset)

Compute the svd and compute z

s = svd(geneExpression-rowMeans(geneExpression))
z = s$d * t(s$v)

Q:What is this max correlation?

max(cor(sampleInfo$g,t(z)))
## [1] 0.6236858

A:0.6236858

MDS Exercises #6

Load the following dataset

library(GSE5859Subset)
data(GSE5859Subset)

Compute the svd and compute z

s = svd(geneExpression-rowMeans(geneExpression))
z = s$d * t(s$v)

Q:Which dimension of z has the second highest correlates with the outcome sampleInfo$group?

which.max(cor(sampleInfo$g,t(z))[-1]) + 1 #We add 1 because we took out the first.
## [1] 6

A:6

MDS Exercises #7

Note these measurements were made during two months:

sampleInfo$date
##  [1] "2005-06-23" "2005-06-27" "2005-06-27" "2005-10-28" "2005-10-07"
##  [6] "2005-10-07" "2005-10-28" "2005-10-07" "2005-10-07" "2005-10-07"
## [11] "2005-10-07" "2005-10-07" "2005-06-27" "2005-06-23" "2005-06-10"
## [16] "2005-06-23" "2005-06-23" "2005-06-27" "2005-06-23" "2005-06-27"
## [21] "2005-06-23" "2005-10-07" "2005-10-07" "2005-10-07"

We can extract the month this way:

month = format( sampleInfo$date, "%m")
month = factor( month)
month
##  [1] 06 06 06 10 10 10 10 10 10 10 10 10 06 06 06 06 06 06 06 06 06 10 10
## [24] 10
## Levels: 06 10

Q:Which dimension of z has the highest correlates with the outcome month

which.max(cor( as.numeric(month), t(z)))
## [1] 1

A:1

What is this correlation?

max(cor( as.numeric(month), t(z)))
## [1] 0.8297915

A: 0.8297915

Note that the same dimension is correlated with both the group and the date. Not also that these are correlated:

table(sampleInfo$g,month)
##    month
##     06 10
##   0  9  3
##   1  3  9

So is this first dimension related directly to group or is it related only through the month? Note that the correlation with month is higher. This is related to batch effects which we will learn about later.

MDS Exercises #8 (ADVANCED)

ote: this is an advanced question. Please feel free to discuss on the forum.

In MDS Exercises #7 we saw that that one of the dimensions was highly correlated to the sampleInfo$group. Now take the 5th column of U and stratify by the gene chromosome. Remove chrUn and make a boxplot of the values of U6 stratified by chromosome.

Which chromosome looks different from the rest? Copy and paste the name as it appears in geneAnnotation

result = split(s$u[,6],geneAnnotation$CHR)
result = result[ which(names(result)!="chrUn") ]
boxplot(result,range=0)

boxplot(result,range=0,ylim=c(-0.025,0.025))

medians = sapply(result,median)
names(result)[ which.max(abs(medians)) ]
## [1] "chrY"

A:chrY

Given the answer to this question, any guesses as to what sampleInfo$group represents?

Principal Component Analysis

http://genomicsclass.github.io/book/pages/PCA.html

We have already mentioned principal component analysis (PCA) above and noted its relation to the SVD. Here we provide further mathematical details.

Example: Twin heights

We started the motivation for dimension reduction with a simulated example and showed a rotation that is very much related to PCA.

Twin heights scatter plot.

Twin heights scatter plot.

Here we explain specifically what are the principal components (PCs).

Let \(\mathbf{Y}\) be \(2 \times N\) matrix representing our data. The analogy is that we measure expression from 2 genes and each column is a sample. Suppose we are given the task of finding a \(2 \times 1\) vector \(\mathbf{u}_1\) such that \(\mathbf{u}_1^\top \mathbf{v}_1 = 1\) and it maximizes \[(\mathbf{u}_1^\top\mathbf{Y})^\top (\mathbf{u}_1^\top\mathbf{Y})\]. This can be viewed as a projection of each sample or column of \(\mathbf{Y}\) into the subspace spanned by \(\mathbf{u}_1\). So we are looking for a transformation in which the coordinates show high variability.

Let’s try \(\mathbf{u}=(1,0)^\top\). This projection simply gives us the height of twin 1 shown in orange below. The sum of squares is shown in the title.

mypar(1,1)
plot(t(Y), xlim=thelim, ylim=thelim,
     main=paste("Sum of squares :",round(crossprod(Y[1,]),1)))
abline(h=0)
apply(Y,2,function(y) segments(y[1],0,y[1],y[2],lty=2))
## NULL
points(Y[1,],rep(0,ncol(Y)),col=2,pch=16,cex=0.75)

Can we find a direction with higher variability? How about:

\(\mathbf{u} =\begin{pmatrix}1\\-1\end{pmatrix}\) ? This does not satisfy \(\mathbf{u}^\top\mathbf{u}= 1\) so let’s instead try \(\mathbf{u} =\begin{pmatrix}1/\sqrt{2}\\-1/\sqrt{2}\end{pmatrix}\)

u <- matrix(c(1,-1)/sqrt(2),ncol=1)
w=t(u)%*%Y
mypar(1,1)
plot(t(Y),
     main=paste("Sum of squares:",round(tcrossprod(w),1)),xlim=thelim,ylim=thelim)
abline(h=0,lty=2)
abline(v=0,lty=2)
abline(0,-1,col=2)
Z = u%*%w
for(i in seq(along=w))
  segments(Z[1,i],Z[2,i],Y[1,i],Y[2,i],lty=2)
points(t(Z), col=2, pch=16, cex=0.5)
Data projected onto space spanned by (1 0).

Data projected onto space spanned by (1 0).

This relates to the difference between twins, which we know is small. The sum of squares confirms this.

Finally, let’s try:

\(\mathbf{u} =\begin{pmatrix}1/\sqrt{2}\\1/\sqrt{2}\end{pmatrix}\)

u <- matrix(c(1,1)/sqrt(2),ncol=1)
w=t(u)%*%Y
mypar()
plot(t(Y), main=paste("Sum of squares:",round(tcrossprod(w),1)),
     xlim=thelim, ylim=thelim)
abline(h=0,lty=2)
abline(v=0,lty=2)
abline(0,1,col=2)
points(u%*%w, col=2, pch=16, cex=1)
Z = u%*%w
for(i in seq(along=w))
  segments(Z[1,i], Z[2,i], Y[1,i], Y[2,i], lty=2)
points(t(Z),col=2,pch=16,cex=0.5)
Data projected onto space spanned by first PC.

Data projected onto space spanned by first PC.

This is a re-scaled average height, which has higher sum of squares. There is a mathematical procedure for determining which \(\mathbf{v}\) maximizes the sum of squares and the SVD provides it for us.

The principal components

The orthogonal vector that maximizes the sum of squares:

\[(\mathbf{u}_1^\top\mathbf{Y})^\top(\mathbf{u}_1^\top\mathbf{Y})\]

\(\mathbf{u}_1^\top\mathbf{Y}\) is referred to as the first PC. The weights \(\mathbf{u}\) used to obtain this PC are referred to as the loadings. Using the language of rotations, it is also referred to as the direction of the first PC, which are the new coordinates.

To obtain the second PC, we repeat the exercise above, but for the residuals:

\[\mathbf{r} = \mathbf{Y} - \mathbf{u}_1^\top \mathbf{Yv}_1 \]

The second PC is the vector with the following properties:

\[ \mathbf{v}_2^\top \mathbf{v}_2=1\]

\[ \mathbf{v}_2^\top \mathbf{v}_1=0\]

and maximizes \((\mathbf{rv}_2)^\top \mathbf{rv}_2\).

When \(Y\) is \(N \times m\) we repeat to find 3rd, 4th, …, m-th PCs.

prcomp

We have shown how to obtain PCs using the SVD. However, R has a function specifically designed to find the principal components. In this case, the data is centered by default. The following function:

pc <- prcomp( t(Y) )

produces the same results as the SVD up to arbitrary sign flips:

s <- svd( Y - rowMeans(Y) )
mypar(1,2)
for(i in 1:nrow(Y) ){
  plot(pc$x[,i], s$d[i]*s$v[,i])
}
Plot showing SVD and prcomp give same results.

Plot showing SVD and prcomp give same results.

The loadings can be found this way:

pc$rotation
##            PC1        PC2
## [1,] 0.7072304  0.7069831
## [2,] 0.7069831 -0.7072304

which are equivalent (up to a sign flip) to:

s$u
##            [,1]       [,2]
## [1,] -0.7072304 -0.7069831
## [2,] -0.7069831  0.7072304

The equivalent of the variance explained is included in the:

pc$sdev
## [1] 1.2542672 0.2141882

component.

We take the transpose of Y because prcomp assumes the previously discussed ordering: units/samples in row and features in columns.

Week 3 :Basic Machine Learning: Clustering

http://genomicsclass.github.io/book/pages/clustering_and_heatmaps.html

** Basic Machine Learning **

Machine learning is a very broad topic and a highly active research area. In the life sciences, much of what is described as “precision medicine” is an application of machine learning to biomedical data. The general idea is to predict or discover outcomes from measured predictors. Can we discover new types of cancer from gene expression profiles? Can we predict drug response from a series of genotypes? Here we give a very brief introductions to two major machine learning components: clustering and class prediction. There are many good resources to learn more about machine learning, for example this book.

Clustering

We will demonstrate the concepts and code needed to perform clustering analysis with the tissue gene expression data:

library(tissuesGeneExpression)
data(tissuesGeneExpression)

To illustrate the main application of clustering in the life sciences, let’s pretend that we don’t know these are different tissues and are interested in clustering. The first step is to compute the distance between each sample:

d <- dist( t(e) )
head(d)
## [1]  85.85460  84.38089 101.09763  90.28968  83.28734 104.89770

Hierarchical clustering

There’s two types of hierarchical clustering, top down and bottom up.

  • In top down, we start dividing things into groups.
  • In bottom up, we join things into groups.

With the distance between each pair of samples computed, we need clustering algorithms to join them into groups. Hierarchical clustering is one of the many clustering algorithms available to do this. Each sample is assigned to its own group and then the algorithm continues iteratively, joining the two most similar clusters at each step, and continuing until there is just one group. While we have defined distances between samples, we have not yet defined distances between groups. There are various ways this can be done and they all rely on the individual pairwise distances. The helpfile for hclust includes detailed information.

We can perform hierarchical clustering based on the distances defined above using the hclust function. This function returns an hclust object that describes the groupings that were created using the algorithm described above. The plot method represents these relationships with a tree or dendrogram:

library(rafalib)
mypar()
hc <- hclust(d)
hc
## 
## Call:
## hclust(d = d)
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 189
plot(hc,labels=tissue,cex=0.5)
Dendrogram showing hierarchical clustering of tissue gene expression data.

Dendrogram showing hierarchical clustering of tissue gene expression data.

Does this technique “discover” the clusters defined by the different tissues? In this plot, it is not easy to see the different tissues so we add colors by using the mypclust function from the rafalib package.

myplclust(hc, labels=tissue, lab.col=as.fumeric(tissue), cex=0.5)
Dendrogram showing hierarchical clustering of tissue gene expression data with colors denoting tissues.

Dendrogram showing hierarchical clustering of tissue gene expression data with colors denoting tissues.

Visually, it does seem as if the clustering technique has discovered the tissues. However, hierarchical clustering does not define specific clusters, but rather defines the dendrogram above. From the dendrogram we can decipher the distance between any two groups by looking at the height at which the two groups split into two. To define clusters, we need to “cut the tree” at some distance and group all samples that are within that distance into groups below. To visualize this, we draw a horizontal line at the height we wish to cut and this defines that line. We use 120 as an example:

myplclust(hc, labels=tissue, lab.col=as.fumeric(tissue),cex=0.5)
abline(h=120)
Dendrogram showing hierarchical clustering of tisuse gene expression data with colors denoting tissues. Horizontal line defines actual clusters.

Dendrogram showing hierarchical clustering of tisuse gene expression data with colors denoting tissues. Horizontal line defines actual clusters.

If we use the line above to cut the tree into clusters, we can examine how the clusters overlap with the actual tissues:

hclusters <- cutree(hc, h=120)
table(true=tissue, cluster=hclusters)
##              cluster
## true           1  2  3  4  5  6  7  8  9 10 11 12 13 14
##   cerebellum   0  0  0  0 31  0  0  0  2  0  0  5  0  0
##   colon        0  0  0  0  0  0 34  0  0  0  0  0  0  0
##   endometrium  0  0  0  0  0  0  0  0  0  0 15  0  0  0
##   hippocampus  0  0 12 19  0  0  0  0  0  0  0  0  0  0
##   kidney       9 18  0  0  0 10  0  0  2  0  0  0  0  0
##   liver        0  0  0  0  0  0  0 24  0  2  0  0  0  0
##   placenta     0  0  0  0  0  0  0  0  0  0  0  0  2  4

We can also ask cutree to give us back a given number of clusters. The function then automatically finds the height that results in the requested number of clusters:

hclusters <- cutree(hc, k=8)
table(true=tissue, cluster=hclusters)
##              cluster
## true           1  2  3  4  5  6  7  8
##   cerebellum   0  0 31  0  0  2  5  0
##   colon        0  0  0 34  0  0  0  0
##   endometrium 15  0  0  0  0  0  0  0
##   hippocampus  0 12 19  0  0  0  0  0
##   kidney      37  0  0  0  0  2  0  0
##   liver        0  0  0  0 24  2  0  0
##   placenta     0  0  0  0  0  0  0  6

In both cases we do see that, with some exceptions, each tissue is uniquely represented by one of the clusters. In some instances, the one tissue is spread across two tissues, which is due to selecting too many clusters. Selecting the number of clusters is generally a challenging step in practice and an active area of research.

Hiearchichal Clustering Exercises

Hiearchichal Clustering Exercises #1

Create a random matrix with no correlation in the following way:

set.seed(1)
m = 10000
n = 24
x = matrix(rnorm(m*n),m,n)
colnames(x)=1:n
head(x)
##               1          2          3          4          5          6
## [1,] -0.6264538 -0.8043316  0.2353485  0.6179223 -0.2212571  0.5258908
## [2,]  0.1836433 -1.0565257  0.2448250  0.8935057  0.3517935 -0.4875444
## [3,] -0.8356286 -1.0353958 -0.6421869 -0.4277562  0.1606019  1.1382508
## [4,]  1.5952808 -1.1855604 -1.9348085 -0.2999012 -0.1240523  1.2151344
## [5,]  0.3295078 -0.5004395  1.0386957 -0.5319833  0.6598739 -0.4248307
## [6,] -0.8204684 -0.5249887 -0.2835501  1.7059816 -0.5038493 -1.4508403
##               7           8           9          10         11
## [1,]  0.3413341 -1.00203611 -1.55915937 -0.09504307  0.7914415
## [2,]  0.4136665  0.02590761  0.20166217 -0.38805939  0.3921679
## [3,]  0.1220357 -0.44814178  1.04017610  2.13657003 -0.4726670
## [4,] -1.5893806  0.84323332  0.07195772  0.55661945 -0.4579517
## [5,] -0.7874385 -0.21846310 -0.01526544 -0.59094164 -0.1681319
## [6,] -1.5920640  0.47678629  0.33938598  1.52014345  0.5856737
##                 12         13         14           15         16
## [1,]  0.0145724495  0.8663644 -0.1604426  0.249371112 -1.2520152
## [2,]  1.7854043337 -0.9476952 -0.9241849 -0.335346796 -0.3351313
## [3,]  0.0002997544  0.4522428 -1.5561751  0.004405287  0.1080678
## [4,]  0.4356948690  0.2782408  0.8812202  0.986768348 -0.4717051
## [5,] -1.4076452475  1.4175945 -0.5263595  0.543575705  2.5070607
## [6,] -0.6929698698  0.6329981 -0.4627372 -0.626142823 -1.2451344
##              17         18          19          20         21          22
## [1,]  2.3150930  0.4414410  0.82485558  1.37086468 -0.2240021  0.07267180
## [2,]  1.0603800  0.4130862  0.74402087  0.54569610 -0.8039004 -0.75752112
## [3,] -0.3970672  0.8660777 -0.69009734 -1.62446330  1.1393495  0.76525702
## [4,]  0.4840034 -2.2615708  1.76900681 -0.06247283  0.9129255 -0.04403916
## [5,] -1.3584146 -0.1787018 -0.55215640  0.57021255  0.3448395  0.76030091
## [6,]  0.6370574 -0.4713582 -0.03257056 -0.31350574  0.7719104  0.40134573
##              23         24
## [1,]  1.3641735  0.4431154
## [2,]  0.2159377  0.9536824
## [3,]  1.0729733 -0.8278467
## [4,]  1.2573367  0.2431909
## [5,] -0.7143611  0.6492437
## [6,] -0.1038661 -1.4455379

Run hierarchical clustering on this data with the hclust function with default parameters to cluster the columns. Create a dendrogram.

From the dendrogram which pairs of samples are the furthest away from each other?

hc = hclust( dist( t(x)))
plot(hc)

A:19 & 14

Hiearchichal Clustering Exercises #2

Set the seed at 1, set.seed(1) and replicate the creation of this matrix 100 times then perform hierarchical clustering as in the solution to question above and find the number of clusters if you use cutree at height 143. Note that this number is a random variable.

Based on the Monte Carlo simulation, what is the standard error of this random variable?

set.seed(1)
m = 10000
n = 24
nc = replicate(100,{
x = matrix(rnorm(m*n),m,n)
hc = hclust( dist( t(x)))
length(unique(cutree(hc,h=143)))
})
plot(table(nc)) ## look at the distribution

popsd(nc)
## [1] 0.8986657

A:0.8986657

K-means

We can also cluster with the kmeans function to perform k-means clustering. As an example, let’s ru n k-means on the samples in the space of the first two genes:

set.seed(1)
km <- kmeans(t(e[1:2,]), centers=7)
names(km)
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
table(tissue,clusters=km$cluster)
##              clusters
## tissue         1  2  3  4  5  6  7
##   cerebellum   0  1  8  0  6  0 23
##   colon        2 11  2 15  4  0  0
##   endometrium  0  3  4  0  0  0  8
##   hippocampus 19  0  2  0 10  0  0
##   kidney       7  8 20  0  0  0  4
##   liver        0  0  0  0  0 18  8
##   placenta     0  4  0  0  0  0  2
mypar(1,2)
plot(e[1,], e[2,], col=as.fumeric(tissue), pch=16)
plot(e[1,], e[2,], col=km$cluster, pch=16)
Plot of gene expression for first two genes (order of appearance in data) with color representing tissue (left) and clusters found with kmeans (right).

Plot of gene expression for first two genes (order of appearance in data) with color representing tissue (left) and clusters found with kmeans (right).

In the first plot, color represents the actual tissues, while in the second, color represents the clusters that were defined by kmeans. We can see from tabulating the results that this particular clustering exercise did not perform well:

table(true=tissue,cluster=km$cluster)
##              cluster
## true           1  2  3  4  5  6  7
##   cerebellum   0  1  8  0  6  0 23
##   colon        2 11  2 15  4  0  0
##   endometrium  0  3  4  0  0  0  8
##   hippocampus 19  0  2  0 10  0  0
##   kidney       7  8 20  0  0  0  4
##   liver        0  0  0  0  0 18  8
##   placenta     0  4  0  0  0  0  2

This is very likely due to the fact the the first two genes are not informative regarding tissue type. We can see this in the first plot above. If we instead perform k-means clustering using all of the genes, we obtain a much improved result. To visualize this, we can use an MDS plot:

km <- kmeans(t(e), centers=7)
d<-dist(t(e))
mds <- cmdscale(d)

mypar(1,2)
plot(mds[,1], mds[,2]) 
plot(mds[,1], mds[,2], col=km$cluster, pch=16)
Plot of gene expression for first two PCs with color representing tissues (left) and clusters found using all genes (right).

Plot of gene expression for first two PCs with color representing tissues (left) and clusters found using all genes (right).

By tabulating the results, we see that we obtain a similar answer to that obtained with hierarchical clustering.

table(true=tissue,cluster=km$cluster)
##              cluster
## true           1  2  3  4  5  6  7
##   cerebellum   0  0  5  0 31  2  0
##   colon        0 34  0  0  0  0  0
##   endometrium  0 15  0  0  0  0  0
##   hippocampus  0  0 31  0  0  0  0
##   kidney       0 37  0  0  0  2  0
##   liver        2  0  0  0  0  0 24
##   placenta     0  0  0  6  0  0  0

K-means Exercises

K-means Exercises #1

Run kmeans with 4 centers for the blood RNA data:

library(GSE5859Subset)
data(GSE5859Subset)

Set the seed to 10, set.seed(10) right before running kmeans with 5 centers.

Explore the relationship of clusters and information in sampleInfo. Which of th following best described what you find:

mds=cmdscale(dist(t(geneExpression)))
set.seed(10)
result=kmeans(t(geneExpression),5)
mypar(1,1)
plot(mds,bg=result$cl,pch=21)

table(sampleInfo$group,result$cluster)
##    
##     1 2 3 4 5
##   0 4 1 2 1 4
##   1 1 6 3 0 2
table(sampleInfo$date,result$cluster)
##             
##              1 2 3 4 5
##   2005-06-10 0 0 0 1 0
##   2005-06-23 5 0 0 0 1
##   2005-06-27 0 0 0 0 5
##   2005-10-07 0 5 5 0 0
##   2005-10-28 0 2 0 0 0
##looks better if we re-order:
table(sampleInfo$date,result$cluster)[,c(4,1,5,3,2)]
##             
##              4 1 5 3 2
##   2005-06-10 1 0 0 0 0
##   2005-06-23 0 5 1 0 0
##   2005-06-27 0 0 5 0 0
##   2005-10-07 0 0 0 5 5
##   2005-10-28 0 0 0 0 2

A:Date is driving the clusters

Heatmaps in R

Heatmaps are ubiquitous in the genomics literature. They are very useful plots for visualizing the measurements for a subset of rows over all the samples. A dendrogram is added on top and on the side that is created with hierarchical clustering. We will demonstrate how to create heatmaps from within R.

library(tissuesGeneExpression)
data(tissuesGeneExpression)
image(e[1:100,])

Let’s begin by defining a color palette:

library(RColorBrewer) 
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)

Now, pick the genes with the top variance over all samples:

library(Biobase)
library(genefilter)
library(parathyroidSE)
rv <- rowVars(e)
idx <- order(-rv)[1:40]

While a heatmap function is included in R, we recommend the heatmap.2 function from the gplots package on CRAN because it is a bit more customized. For example, it stretches to fill the window. Here we add colors to indicate the tissue on the top:

library(gplots) ##Available from CRAN
library(rafalib)
cols <- palette(brewer.pal(8, "Dark2"))[as.fumeric(tissue)]
head(cbind(colnames(e),cols))
##                        cols     
## [1,] "GSM11805.CEL.gz" "#1B9E77"
## [2,] "GSM11814.CEL.gz" "#1B9E77"
## [3,] "GSM11823.CEL.gz" "#1B9E77"
## [4,] "GSM11830.CEL.gz" "#1B9E77"
## [5,] "GSM12067.CEL.gz" "#1B9E77"
## [6,] "GSM12075.CEL.gz" "#1B9E77"
heatmap.2(e[idx,], labCol=tissue,
          trace="none", 
          ColSideColors=cols, 
          col=hmcol)
Heatmap created using the 40 most variable genes and the function heatmap.2.

Heatmap created using the 40 most variable genes and the function heatmap.2.

We did not use tissue information to create this heatmap, and we can quickly see, with just 40 genes, good separation across tissues.

Heat Maps Exercises

Heat Maps Exercises #1

Load the data:

library(GSE5859Subset)
data(GSE5859Subset)

Pick the 25 genes with the highest across sample variance. This function might help

#install.packages("matrixStats")
library(matrixStats)
## matrixStats v0.51.0 (2016-10-08) successfully loaded. See ?matrixStats for help.
## 
## Attaching package: 'matrixStats'
## The following object is masked from 'package:SummarizedExperiment':
## 
##     rowRanges
## The following objects are masked from 'package:genefilter':
## 
##     rowSds, rowVars
## The following objects are masked from 'package:Biobase':
## 
##     anyMissing, rowMedians
## The following objects are masked from 'package:mosaic':
## 
##     count, iqr
## The following object is masked from 'package:dplyr':
## 
##     count
#?rowMads ##we use mads due to a outlier sample

While a heatmap function is included in R, we recommend the heatmap.2 function from the gplots package on CRAN because it is a bit more customized. For example, it stretches to fill the window.

library(gplots)

Use heatmap.2 to make a heatmap showing the sampleInfo$group with color, the date as labels, the rows labelled with chromosome, and scaling the rows.

What do we learn from this heatmap?

##load libraries
library(rafalib)
library(gplots)
library(matrixStats)
library(RColorBrewer)
##make colors
cols = colorRampPalette(rev(brewer.pal(11,"RdBu")))(25)
gcol=brewer.pal(3,"Dark2")
gcol=gcol[sampleInfo$g+1]

##make lables: remove 2005 since it's common to all
labcol= gsub("2005-","",sampleInfo$date)  

##pick highly variable genes:
sds =rowMads(geneExpression)
ind = order(sds,decreasing=TRUE)[1:25]

## make heatmap
heatmap.2(geneExpression[ind,],
          col=cols,
          trace="none",
          scale="row",
          labRow=geneAnnotation$CHR[ind],
          labCol=labcol,
          ColSideColors=gcol,
          key=FALSE)

A:A group of chrY genes are higher in group 0 and appear to drive the clustering. Within those clusters there appears to be clustering by month.

Heat Maps Exercises #2

Create a large data set of random data that is completely independent of sampleInfo$group like this:

set.seed(17)
m = nrow(geneExpression)
n = ncol(geneExpression)
x = matrix(rnorm(m*n),m,n)
g = factor(sampleInfo$g )

Create two heatmaps with these data. Show the group g either with labels or colors.

  1. Taking the 50 genes with smallest p-values obtained with rowttests
  2. Taking the 50 genes with largest standard deviations.

Which of the following statements is true:

library(gplots)
library(matrixStats)
library(genefilter)
library(RColorBrewer)
cols = colorRampPalette(rev(brewer.pal(11,"RdBu")))(25)

ttest = rowttests(x,g)
sds = rowSds(x)
Indexes = list(t=order(ttest$p.value)[1:50], s=order(-sds)[1:50])
for(ind in Indexes){
 heatmap.2(x[ind,],
          col=cols,
          trace="none",
          scale="row",
          labCol=g,
          key=FALSE)
 }

A:There is no relationship between g and x but with 8,793 tests some will appear significant by chance. Selecting genes with the t-test gives us a deceiving result.

CONDITIONAL EXPECTATIONS

Throughout this assessment it will be useful to remember that when our data are 0s and 1s probabilities and expectations are the same thing. We can do the math, but here is some R code:

n = 1000
y = rbinom(n,1,0.25)
##proportion of ones Pr(Y)
sum(y==1)/length(y)
## [1] 0.26
##expectaion of Y
mean(y)
## [1] 0.26

Week 3:Basic Machine Learning: Classification

Conditional Expectations

Conditional Probabilities and Expectations

Prediction problems can be divided into categorical and continuous outcomes. However, many of the algorithms can be applied to both due to the connection between conditional probabilities and conditional expectations.

For categorical data, for example binary outcomes, if we know the probability of \(Y\) being any of the possible outcomes \(k\) given a set of predictors \(X=(X_1,\dots,X_p)^\top\),

\[ f_k(x) = \mbox{Pr}(Y=k \mid X=x) \]

we can optimize our predictions. Specifically, for any \(x\) we predict the \(k\) that has the largest probability \(f_k(x)\).

To simplify the exposition below, we will consider the case of binary data. You can think of the probability \(\mbox{Pr}(Y=1 \mid X=x)\) as the proportion of 1s in the stratum of the population for which \(X=x\). Given that the expectation is the average of all \(Y\) values, in this case the expectation is equivalent to the probability: \(f(x) \equiv \mbox{E}(Y \mid X=x)=\mbox{Pr}(Y=1 \mid X=x)\). We therefore use only the expectation in the descriptions below as it is more general.

In general, the expected value has an attractive mathematical property and it is that minimized the expected distance between the predictor \(\hat{Y}\) and \(Y\):

\[ \mbox{E}\{ (\hat{Y} - Y)^2 \mid X=x \} \]

Regression in the context of prediction

We use the son and father height example to illustrate how regression can be interpreted as a machine learning technique. In our example, we are trying to predict the son’s height \(Y\) based on the father’s \(X\). Here we have only one predictor. Now if we were asked to predict the height of a randomly selected son, we would go with the average height:

library(rafalib)
mypar(1,1)
data(father.son,package="UsingR")
x=round(father.son$fheight) ##round to nearest inch
y=round(father.son$sheight)
hist(y,breaks=seq(min(y),max(y)))
abline(v=mean(y),col="red",lwd=2)
Histogram of son heights.

Histogram of son heights.

In this case, we can also approximate the distribution of \(Y\) as normal, which implies the mean maximizes the probability density.

Let’s imagine that we are given more information. We are told that the father of this randomly selected son has a height of 71 inches (1.25 SDs taller than the average). What is our prediction now?

mypar(1,2)
plot(x,y,xlab="Father's height in inches",ylab="Son's height in inches",main=paste("correlation =",signif(cor(x,y),2)))
abline(v=c(-0.35,0.35)+71,col="red")
hist(y[x==71],xlab="Heights",nc=8,main="",xlim=range(y))
Son versus father height (left) with the red lines denoting the stratum defined by conditioning on fathers being 71 inches tall. Conditional distribution: son height distribution of stratum defined by 71 inch fathers.

Son versus father height (left) with the red lines denoting the stratum defined by conditioning on fathers being 71 inches tall. Conditional distribution: son height distribution of stratum defined by 71 inch fathers.

The best guess is still the expectation, but our strata has changed from all the data, to only the \(Y\) with \(X=71\). So we can stratify and take the average, which is the conditional expectation. Our prediction for any \(x\) is therefore:

\[ f(x) = E(Y \mid X=x) \]

It turns out that because this data is approximated by a bivariate normal distribution, using calculus, we can show that:

\[ f(x) = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X} (X-\mu_X) \]

and if we estimate these five parameters from the sample, we get the regression line:

mypar(1,2)
plot(x,y,xlab="Father's height in inches",ylab="Son's height in inches",main=paste("correlation =",signif(cor(x,y),2)))
abline(v=c(-0.35,0.35)+71,col="red")

fit <- lm(y~x)
abline(fit,col=1)

hist(y[x==71],xlab="Heights",nc=8,main="",xlim=range(y))
abline(v = fit$coef[1] + fit$coef[2]*71, col=1)
Son versus father height showing predicted heights based on regression line (left). Conditional distribution with vertical line representing regression prediction.

Son versus father height showing predicted heights based on regression line (left). Conditional distribution with vertical line representing regression prediction.

In this particular case, the regression line provides an optimal prediction function for \(Y\). But this is not generally true because, in the typical machine learning problems, the optimal \(f(x)\) is rarely a simple line.

Conditional Expectation Exercises

CONDITIONAL EXPECTATIONS

Throughout this assessment it will be useful to remember that when our data are 0s and 1s, probabilities and expectations are the same thing. We can do the math, but here is an example in the form of R code:

n = 1000
y = rbinom(n,1,0.25)
##proportion of ones Pr(Y)
sum(y==1)/length(y)
## [1] 0.247
##expectaion of Y
mean(y)
## [1] 0.247

Conditional Expectation Exercises #1

Generate some random data to imitate heights for men (0) and women (1):

n = 10000
set.seed(1)
men = rnorm(n,176,7) #height in centimeters
women = rnorm(n,162,7) #height in centimeters
y = c(rep(0,n),rep(1,n))
x = round(c(men,women))
##mix it up
ind = sample(seq(along=y))
y = y[ind]
x = x[ind]
10

10

mean(y[x==176])
## [1] 0.1049475

A:0.1049475

Conditional Expectation Exercises #2

11

11

xs = seq(160,178)
pr =sapply(xs,function(x0) mean(y[x==x0]))
plot(xs,pr)
abline(h=0.5)
abline(v=168)

A:168

Class Prediction

Here we give a brief introduction to the main task of machine learning:class prediction. In fact, many refer to class prediction as machine learning and we sometimes use the two terms interchangeably. We give a very brief introduction to this vast topic, focusing on some specific examples.

Some of the examples we give here are motivated by those in the excellent textbook The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor Hastie, Robert Tibshirani and Jerome Friedman. A free PDF of this book can be found at the following URL:

http://statweb.stanford.edu/~tibs/ElemStatLearn/

Similar to inference in the context of regression, Machine Learning (ML) studies the relationships between outcomes \(Y\) and covariates \(X\). In ML, we call \(X\) the predictors or features. The main difference between ML and inference is that, in ML, we are interested mainly in predicting \(Y\) using \(X\). Statistical models are used, but while in inference we estimate and interpret model parameters, in ML they are mainly a means to an end: predicting \(Y\).

Here we introduce the main concepts needed to understand ML, along with two specific algorithms: regression and k nearest neighbors (kNN). Keep in mind that there are dozens of popular algorithms that we do not cover here.

In a previous section, we covered the very simple one-predictor case. However, most of ML is concerned with cases with more than one predictor. For illustration purposes, we move to a case in which \(X\) is two dimensional and \(Y\) is binary. We simulate a situation with a non-linear relationship using an example from the Hastie, Tibshirani and Friedman book. In the plot below, we show the actual values of \(f(x_1,x_2)=E(Y \mid X_1=x_1,X_2=x_2)\) using colors. The following code is used to create a relatively complex conditional probability function. We create the test and train data we use later (code not shown). Here is the plot of \(f(x_1,x_2)\) with red representing values close to 1, blue representing values close to 0, and yellow values in between.

library(rafalib)
library(RColorBrewer)
hmcol <- colorRampPalette(rev(brewer.pal(11, "Spectral")))(100)
mycols=c(hmcol[1],hmcol[100])

set.seed(1)
##create covariates and outcomes
##outcomes are alwasy 50 0s and 50 1s
s2=0.15

##pick means to create a non linear conditional expectation
library(MASS)
M0 <- mvrnorm(10,c(1,0),s2*diag(2)) ##generate 10 means
M1 <- rbind(mvrnorm(3,c(1,1),s2*diag(2)),
            mvrnorm(3,c(0,1),s2*diag(2)),
            mvrnorm(4,c(0,0),s2*diag(2)))

###funciton to generate random pairs
s<- sqrt(1/5)
N=200
makeX <- function(M,n=N,sigma=s*diag(2)){
  z <- sample(1:10,n,replace=TRUE) ##pick n at random from above 10
  m <- M[z,] ##these are the n vectors (2 components)
  return(t(apply(m,1,function(mu) mvrnorm(1,mu,sigma)))) ##the final values
}


###create the training set and the test set
x0 <- makeX(M0)##the final values for y=0 (green)
testx0 <- makeX(M0)
x1 <- makeX(M1)
testx1 <-makeX(M1)
x <- rbind(x0,x1) ##one matrix with everything
test <- rbind(testx0,testx1)
y <- c(rep(0,N),rep(1,N)) #the outcomes
ytest <- c(rep(0,N),rep(1,N))
cols <- mycols[c(rep(1,N),rep(2,N))]
colstest <- cols

##Create a grid so we can predict all of X,Y
GS <- 150 ##grid size is GS x GS
XLIM <- c(min(c(x[,1],test[,1])),max(c(x[,1],test[,1])))
tmpx <- seq(XLIM[1],XLIM[2],len=GS)
YLIM <- c(min(c(x[,2],test[,2])),max(c(x[,2],test[,2])))
tmpy <- seq(YLIM[1],YLIM[2],len=GS)
newx <- expand.grid(tmpx,tmpy) #grid used to show color contour of predictions

###Bayes rule: best possible answer
p <- function(x){ ##probability of Y given X
  p0 <- mean(dnorm(x[1],M0[,1],s)*dnorm(x[2],M0[,2],s))
  p1 <- mean(dnorm(x[1],M1[,1],s)*dnorm(x[2],M1[,2],s))
  p1/(p0+p1)
}

###Create the bayesrule prediction
bayesrule <- apply(newx,1,p)
colshat <- bayesrule

colshat <- hmcol[floor(bayesrule*100)+1]

mypar()
plot(x,type="n",xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
points(newx,col=colshat,pch=16,cex=0.35)
Probability of Y=1 as a function of X1 and X2. Red is close to 1, yellow close 0.5 nad blue close to 0.

Probability of Y=1 as a function of X1 and X2. Red is close to 1, yellow close 0.5 nad blue close to 0.

If we show points for which \(E(Y \mid X=x)>0.5\) in red and the rest in blue, we see the boundary region that denotes the boundary in which we switch from predicting 0 to 1.

mypar()

colshat[bayesrule>=0.5] <- mycols[2]
colshat[bayesrule<0.5] <- mycols[1]

plot(x,type="n",xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
points(newx,col=colshat,pch=16,cex=0.35)
contour(tmpx,tmpy,matrix(round(bayesrule),GS,GS),levels=c(1,2),
        add=TRUE,drawlabels=FALSE)
Bayes rule. The line devides part of the space for which probability is larger than 0.5 (red) and lower than 0.5 (blue).

Bayes rule. The line devides part of the space for which probability is larger than 0.5 (red) and lower than 0.5 (blue).

The above plots relate to the “truth” that we do not get to see. Most ML methodology is concerned with estimating \(f(x)\). A typical first step in usually to consider a sample, referred to the training set, to estimate \(f(x)\). We will review two specific ML techniques. First, we need to review the main concept we use to evaluate the performance of these methods.

Training and test sets

In the code (not shown) for the first plot in this chapter, we created a test and a training set. We plot them here:

#x, test, cols, and coltest were created in code that was not shown
#x is training x1 and x2, test is test x1 and x2
#cols (0=blue, 1=red) are training observations
#coltests are test observations
mypar(1,2)
plot(x,pch=21,bg=cols,xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
plot(test,pch=21,bg=colstest,xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
Training data (left) and test data (right)

Training data (left) and test data (right)

You will notice that the test and train set have similar global properties since they were generated by the same random variables (more blue towards the bottom right), but are, by construction, different. The reason we create test and training sets is to detect over-training by testing on a different data than the one used to fit models or train algorithms. We will see how important this is below.

Predicting with regression

A first naive approach to this ML problem is to fit a two variable linear regression model:

##x and y were created in the code (not shown) for the first plot
#y is outcome for the training set
X1 <- x[,1] ##these are the covariates
X2 <- x[,2] 
fit1 <- lm(y~X1+X2)

Once we the have fitted values, we can estimate \(f(x_1,x_2)\) with \(\hat{f}(x_1,x_2)=\hat{\beta}_0 + \hat{\beta}_1x_1 +\hat{\beta}_2 x_2\). To provide an actual prediction, we simply predict 1 when \(\hat{f}(x_1,x_2)>0.5\). We now examine the error rates in the test and training sets and also plot the boundary region:

##prediction on train
yhat <- predict(fit1)
yhat <- as.numeric(yhat>0.5)
cat("Linear regression prediction error in train:",1-mean(yhat==y),"\n")
## Linear regression prediction error in train: 0.295

We can quickly obtain predicted values for any set of values using the predict function:

yhat <- predict(fit1,newdata=data.frame(X1=newx[,1],X2=newx[,2]))

Now we can create a plot showing where we predict 1s and where we predict 0s, as well as the boundary. We can also use the predict function to obtain predicted values for our test set. Note that nowhere do we fit the model on the test set:

colshat <- yhat
colshat[yhat>=0.5] <- mycols[2]
colshat[yhat<0.5] <- mycols[1]
m <- -fit1$coef[2]/fit1$coef[3] #boundary slope
b <- (0.5 - fit1$coef[1])/fit1$coef[3] #boundary intercept

##prediction on test
yhat <- predict(fit1,newdata=data.frame(X1=test[,1],X2=test[,2]))
yhat <- as.numeric(yhat>0.5)
cat("Linear regression prediction error in test:",1-mean(yhat==ytest),"\n")
## Linear regression prediction error in test: 0.3075
plot(test,type="n",xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
abline(b,m)
points(newx,col=colshat,pch=16,cex=0.35)

##test was created in the code (not shown) for the first plot
points(test,bg=cols,pch=21)
We estimate the probability of 1 with a linear regression model with X1 and X2 as predictors. The resulting prediction map is divided into parts that are larger than 0.5 (red) and lower than 0.5 (blue).

We estimate the probability of 1 with a linear regression model with X1 and X2 as predictors. The resulting prediction map is divided into parts that are larger than 0.5 (red) and lower than 0.5 (blue).

The error rates in the test and train sets are quite similar. Thus, we do not seem to be over-training. This is not surprising as we are fitting a 2 parameter model to 400 data points. However, note that the boundary is a line. Because we are fitting a plane to the data, there is no other option here. The linear regression method is too rigid. The rigidity makes it stable and avoids over training, but it also keeps the model from adapting to the non-linear relationship between \(Y\) and \(X\). We saw this before in the smoothing section. The next ML technique we consider is similar to the smoothing techniques described before.

K-nearest neighbor

K-nearest neighbors (kNN) is similar to bin smoothing, but it is easier to adapt to multiple dimensions. Basically, for any point \(x\) for which we want an estimate, we look for the k nearest points and then take an average of these points. This gives us an estimate of \(f(x_1,x_2)\), just like the bin smoother gave us an estimate of a curve. We can now control flexibility through \(k\). Here we compare \(k=1\) and \(k=100\).

library(class)
mypar(2,2)
for(k in c(1,100)){
  ##predict on train
  yhat <- knn(x,x,y,k=k)
  cat("KNN prediction error in train:",1-mean((as.numeric(yhat)-1)==y),"\n")
  ##make plot
  yhat <- knn(x,test,y,k=k)
  cat("KNN prediction error in test:",1-mean((as.numeric(yhat)-1)==ytest),"\n")
}
## KNN prediction error in train: 0 
## KNN prediction error in test: 0.375 
## KNN prediction error in train: 0.2425 
## KNN prediction error in test: 0.2825

To visualize why we make no errors in the train set and many errors in the test set when \(k=1\) and obtain more stable results from \(k=100\), we show the prediction regions (code not shown):

library(class)
mypar(2,2)
for(k in c(1,100)){
  ##predict on train
  yhat <- knn(x,x,y,k=k)
  ##make plot
  yhat <- knn(x,newx,y,k=k)
  colshat <- mycols[as.numeric(yhat)]
  plot(x,type="n",xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
  points(newx,col=colshat,cex=0.35,pch=16)
  contour(tmpx,tmpy,matrix(as.numeric(yhat),GS,GS),levels=c(1,2),
          add=TRUE,drawlabels=FALSE)
  points(x,bg=cols,pch=21)
  title(paste("Train: KNN (",k,")",sep=""))
  
  plot(test,type="n",xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
  points(newx,col=colshat,cex=0.35,pch=16)
  contour(tmpx,tmpy,matrix(as.numeric(yhat),GS,GS),levels=c(1,2),
          add=TRUE,drawlabels=FALSE)
  points(test,bg=cols,pch=21)
  title(paste("Test: KNN (",k,")",sep=""))
}
Prediction regions obtained with kNN for k=1 (top) and k=200 (bottom). We show both train (left) and test data (right).

Prediction regions obtained with kNN for k=1 (top) and k=200 (bottom). We show both train (left) and test data (right).

When \(k=1\), we make no mistakes in the training test since every point is its closest neighbor and it is equal to itself. However, we see some islands of blue in the red area that, once we move to the test set, are more error prone. In the case \(k=100\), we do not have this problem and we also see that we improve the error rate over linear regression. We can also see that our estimate of \(f(x_1,x_2)\) is closer to the truth.

Bayes rule

Here we include a comparison of the test and train set errors for various values of \(k\). We also include the error rate that we would make if we actually knew \(\mbox{E}(Y \mid X_1=x1,X_2=x_2)\) referred to as Bayes Rule.

We start by computing the error rates…

###Bayes Rule
yhat <- apply(test,1,p)
cat("Bayes rule prediction error in train",1-mean(round(yhat)==y),"\n")
## Bayes rule prediction error in train 0.2775
bayes.error=1-mean(round(yhat)==y)
train.error <- rep(0,16)
test.error <- rep(0,16)
for(k in seq(along=train.error)){
  ##predict on train
  yhat <- knn(x,x,y,k=2^(k/2))
  train.error[k] <- 1-mean((as.numeric(yhat)-1)==y)
  ##prediction on test    
  yhat <- knn(x,test,y,k=2^(k/2))
  test.error[k] <- 1-mean((as.numeric(yhat)-1)==y)
}

… and then plot the error rates against values of \(k\). We also show the Bayes rules error rate as a horizontal line.

ks <- 2^(seq(along=train.error)/2)
mypar()
plot(ks,train.error,type="n",xlab="K",ylab="Prediction Error",log="x",ylim=range(c(test.error,train.error)))
lines(ks,train.error,type="b",col=4,lty=2,lwd=2)
lines(ks,test.error,type="b",col=5,lty=3,lwd=2)
abline(h=bayes.error,col=6)
legend("bottomright",c("Train","Test","Bayes"),col=c(4,5,6),lty=c(2,3,1),box.lwd=0)
Prediction error in train (pink) and test (green) versus number of neighbors. The yellow line represents what one obtains with Bayes Rule.

Prediction error in train (pink) and test (green) versus number of neighbors. The yellow line represents what one obtains with Bayes Rule.

Note that these error rates are random variables and have standard errors. In the next section we describe cross-validation which helps reduce some of this variability. However, even with this variability, the plot clearly shows the problem over over-fitting when using values lower than 20 and under-fitting with values above 100.

Smoothing

Smoothing is a very powerful technique used all across data analysis. It is designed to estimate \(f(x)\) when the shape is unknown, but assumed to be smooth. The general idea is to group data points that are expected to have similar expectations and compute the average, or fit a simple parametric model. We illustrate two smoothing techniques using a gene expression example.

The following data are gene expression measurements from replicated RNA samples.

##Following three packages are available from Bioconductor
library(Biobase)
library(SpikeIn)
library(hgu95acdf)
## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail'
## when loading 'hgu95acdf'
## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head'
## when loading 'hgu95acdf'
data(SpikeIn95)

We consider the data used in an MA-plot comparing two replicated samples ( \(Y\) = log ratios and \(X\) = averages) and take down-sample in a way that balances the number of points for different strata of \(X\) (code not shown):

library(rafalib)
mypar()
plot(X,Y)
MAplot comparing gene expression from two arrays.

MAplot comparing gene expression from two arrays.

In the MA plot we see that \(Y\) depends on \(X\). This dependence must be a bias because these are based on replicates, which means \(Y\) should be 0 on average regardless of \(X\). We want to predict \(f(x)=\mbox{E}(Y \mid X=x)\) so that we can remove this bias. Linear regression does not capture the apparent curvature in \(f(x)\):

mypar()
plot(X,Y)
fit <- lm(Y~X)
points(X,Y,pch=21,bg=ifelse(Y>fit$fitted,1,3))
abline(fit,col=2,lwd=4,lty=2)
MAplot comparing gene expression from two arrays with fitted regression line. The two colors represent positive and negative residuals.

MAplot comparing gene expression from two arrays with fitted regression line. The two colors represent positive and negative residuals.

The points above the fitted line (green) and those below (purple) are not evenly distributed. We therefore need an alternative more flexible approach.

Bin Smoothing

Instead of fitting a line, let’s go back to the idea of stratifying and computing the mean. This is referred to as bin smoothing. The general idea is that the underlying curve is “smooth” enough so that, in small bins, the curve is approximately constant. If we assume the curve is constant, then all the \(Y\) in that bin have the same expected value. For example, in the plot below, we highlight points in a bin centered at 8.6, as well as the points of a bin centered at 12.1, if we use bins of size 1. We also show the fitted mean values for the \(Y\) in those bins with dashed lines (code not shown):

MAplot comparing gene expression from two arrays with bin smoother fit shown for two points.

MAplot comparing gene expression from two arrays with bin smoother fit shown for two points.

By computing this mean for bins around every point, we form an estimate of the underlying curve \(f(x)\). Below we show the procedure happening as we move from the smallest value of \(x\) to the largest. We show 10 intermediate cases as well (code not shown):

Illustration of how bin smoothing estimates a curve. howing in 12 steps of process.

Illustration of how bin smoothing estimates a curve. howing in 12 steps of process.

The final result looks like this (code not shown):

MA-plot with curve obtained with bin smooth shown.

MA-plot with curve obtained with bin smooth shown.

There are several functions in R that implement bin smoothers. One example is ksmooth. However, in practice, we typically prefer methods that use slightly more complicated models than fitting a constant. The final result above, for example, is somewhat wiggly. Methods such as loess, which we explain next, improve on this.

Loess

Local weighted regression (loess) is similar to bin smoothing in principle. The main difference is that we approximate the local behavior with a line or a parabola. This permits us to expand the bin sizes, which stabilizes the estimates. Below we see lines fitted to two bins that are slightly larger than those we used for the bin smoother (code not shown). We can use larger bins because fitting lines provide slightly more flexibility.

MAplot comparing gene expression from two arrays with bin local regression fit shown for two points.

MAplot comparing gene expression from two arrays with bin local regression fit shown for two points.

As we did for the bin smoother, we show 12 steps of the process that leads to a loess fit (code not shown):

Illustration of how loess estimates a curves. Showing 12 steps of the process.

Illustration of how loess estimates a curves. Showing 12 steps of the process.

The final result is a smoother fit than the bin smoother since we use larger sample sizes to estimate our local parameters (code not shown):

MA-plot with curve obtained with by loess.

MA-plot with curve obtained with by loess.

The function loess performs this analysis for us:

fit <- loess(Y~X, degree=1, span=1/3)

newx <- seq(min(X),max(X),len=100) 
smooth <- predict(fit,newdata=data.frame(X=newx))

mypar ()
plot(X,Y,col="darkgrey",pch=16)
lines(newx,smooth,col="black",lwd=3)
Loess fitted with the loess function.

Loess fitted with the loess function.

There are three other important differences between loess and the typical bin smoother. The first is that rather than keeping the bin size the same, loess keeps the number of points used in the local fit the same. This number is controlled via the span argument which expects a proportion. For example, if N is the number of data points and span=0.5, then for a given \(x\) , loess will use the 0.5*N closest points to \(x\) for the fit. The second difference is that, when fitting the parametric model to obtain \(f(x)\), loess uses weighted least squares, with higher weights for points that are closer to \(x\). The third difference is that loess has the option of fitting the local model robustly. An iterative algorithm is implemented in which, after fitting a model in one iteration, outliers are detected and downweighted for the next iteration. To use this option, we use the argument family="symmetric".

Smoothing Exercises

Smoothing Exercises #1

n = 10000
set.seed(1)
men = rnorm(n,176,7) #height in centimeters
women = rnorm(n,162,7) #height in centimeters
y = c(rep(0,n),rep(1,n))
x = round(c(men,women))
##mix it up
ind = sample(seq(along=y))
y = y[ind]
x = x[ind]

Set the seed at 5, set.seed(5) and take a random sample of 250 individuals from the population like this:

set.seed(5)
N = 250
ind = sample(length(y),N)
Y = y[ind]
X = x[ind]
12

12

fit=loess(Y~X)
predict(fit,newdata=data.frame(X=168))
##         1 
## 0.5480233
##Here is a plot
xs = seq(160,178)
Pr =sapply(xs,function(x0) mean(Y[X==x0]))
plot(xs,Pr)
fitted=predict(fit,newdata=data.frame(X=xs))
lines(xs,fitted)

A:0.5480233

Smoothing Exercises #2

The loess estimate above is a random variable thus we should compute its standard error. Use Monte Carlo simulation to compute the standard error of your estimate of f(168).

Set the seed to 5, set.seed(5) and perform 1000 simulation of the computations performed in question 2.7.1. Report the the SE of the loess based estimate.

##plot plots are optional
set.seed(5)
B = 1000
N = 250
xs = seq(160,178)
plot(xs,xs,ylim=c(0,1),type="l")
res = replicate(B,{
  ind = sample(length(y),N)
  Y = y[ind]
  X = x[ind]
  fit=loess(Y~X)
  ##optional plots
  fitted=predict(fit,newdata=data.frame(X=xs))
  lines(xs,fitted)
  estimate = predict(fit,newdata=data.frame(X=168))
  return(estimate)
  })

library(rafalib)
popsd(res)
## [1] 0.05755689

A:0.05755689

Cross-validation

Here we describe cross-validation: one of the fundamental methods in machine learning for method assessment and picking parameters in a prediction or machine learning task. Suppose we have a set of observations with many features and each observation is associated with a label. We will call this set our training data. Our task is to predict the label of any new samples by learning patterns from the training data. For a concrete example, let’s consider gene expression values, where each gene acts as a feature. We will be given a new set of unlabeled data (the test data) with the task of predicting the tissue type of the new samples.

If we choose a machine learning algorithm with a tunable parameter, we have to come up with a strategy for picking an optimal value for this parameter. We could try some values, and then just choose the one which performs the best on our training data, in terms of the number of errors the algorithm would make if we apply it to the samples we have been given for training. However, we have seen how this leads to over-fitting.

Let’s start by loading the tissue gene expression dataset:

library(tissuesGeneExpression)
data(tissuesGeneExpression)

For illustration purposes, we will drop one of the tissues which doesn’t have many samples:

table(tissue)
## tissue
##  cerebellum       colon endometrium hippocampus      kidney       liver 
##          38          34          15          31          39          26 
##    placenta 
##           6
ind <- which(tissue != "placenta")
y <- tissue[ind]
X <- t( e[,ind] )
dim(X)
## [1]   183 22215
length(y)
## [1] 183

This tissue will not form part of our example.

Now let’s try out k-nearest neighbors for classification, using \(k=5\). What is our average error in predicting the tissue in the training set, when we’ve used the same data for training and for testing?

library(class)
pred <- knn(train =  X, test = X, cl=y, k=5)
mean(y != pred)
## [1] 0

We have no errors in prediction in the training set with \(k=5\). What if we use \(k=1\)?

pred <- knn(train=X, test=X, cl=y, k=1)
mean(y != pred)
## [1] 0

Trying to classify the same observations as we use to train the model can be very misleading. In fact, for k-nearest neighbors, using k=1 will always give 0 classification error in the training set, because we use the single observation to classify itself. The reliable way to get a sense of the performance of an algorithm is to make it give a prediction for a sample it has never seen. Similarly, if we want to know what the best value for a tunable parameter is, we need to see how different values of the parameter perform on samples, which are not in the training data.

Cross-validation is a widely-used method in machine learning, which solves this training and test data problem, while still using all the data for testing the predictive accuracy. It accomplishes this by splitting the data into a number of folds. If we have \(N\) folds, then the first step of the algorithm is to train the algorithm using \((N-1)\) of the folds, and test the algorithm’s accuracy on the single left-out fold. This is then repeated N times until each fold has been used as in the test set. If we have \(M\) parameter settings to try out, then this is accomplished in an outer loop, so we have to fit the algorithm a total of \(N \times M\) times.

We will use the createFolds function from the caret package to make 5 folds of our gene expression data, which are balanced over the tissues. Don’t be confused by the fact that the createFolds function uses the same letter ‘k’ as the ‘k’ in k-nearest neighbors. These ‘k’ are totally unrelated. The caret function createFolds is asking for how many folds to create, the \(N\) from above. The ‘k’ in the knn function is for how many closest observations to use in classifying a new sample. Here we will create 10 folds:

library(caret)
set.seed(1)
idx <- createFolds(y, k=10)
sapply(idx, length)
## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 
##     18     19     17     17     18     20     19     19     20     16

The folds are returned as a list of numeric indices. The first fold of data is therefore:

y[idx[[1]]] ##the labels
##  [1] "kidney"      "kidney"      "hippocampus" "hippocampus" "hippocampus"
##  [6] "cerebellum"  "cerebellum"  "cerebellum"  "colon"       "colon"      
## [11] "colon"       "colon"       "kidney"      "kidney"      "endometrium"
## [16] "endometrium" "liver"       "liver"
head( X[idx[[1]], 1:3] ) ##the genes (only showing the first 3 genes...)
##                 1007_s_at  1053_at   117_at
## GSM12075.CEL.gz  9.966782 6.060069 7.644452
## GSM12098.CEL.gz  9.945652 5.927861 7.847192
## GSM21214.cel.gz 10.955428 5.776781 7.493743
## GSM21218.cel.gz 10.757734 5.984170 8.525524
## GSM21230.cel.gz 11.496114 5.760156 7.787561
## GSM87086.cel.gz  9.798633 5.862426 7.279199

We can see that, in fact, the tissues are fairly equally represented across the 10 folds:

sapply(idx, function(i) table(y[i]))
##             Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09
## cerebellum       3      4      4      4      4      4      4      4      4
## colon            4      3      3      3      4      4      3      3      4
## endometrium      2      2      1      1      1      2      1      2      2
## hippocampus      3      3      3      3      3      3      4      3      3
## kidney           4      4      3      4      4      4      4      4      4
## liver            2      3      3      2      2      3      3      3      3
##             Fold10
## cerebellum       3
## colon            3
## endometrium      1
## hippocampus      3
## kidney           4
## liver            2

Because tissues have very different gene expression profiles, predicting tissue with all genes will be very easy. For illustration purposes we will try to predict tissue type with just two dimensional data. We will reduce the dimension of our data using cmdscale:

library(rafalib)
mypar()
Xsmall <- cmdscale(dist(X))
plot(Xsmall,col=as.fumeric(y))
legend("topleft",levels(factor(y)),fill=seq_along(levels(factor(y))))
First two PCs of the tissue gene expression data with color representing tissue. We use these two PCs as our two predictors throughout.

First two PCs of the tissue gene expression data with color representing tissue. We use these two PCs as our two predictors throughout.

Now we can try out the k-nearest neighbors method on a single fold. We provide the knn function with all the samples in Xsmall except those which are in the first fold. We remove these samples using the code -idx[[1]] inside the square brackets. We then use those samples in the test set. The cl argument is for the true classifications or labels (here, tissue) of the training data. We use 5 observations to classify in our k-nearest neighbor algorithm:

pred <- knn(train=Xsmall[ -idx[[1]] , ], test=Xsmall[ idx[[1]], ], cl=y[ -idx[[1]] ], k=5)
table(true=y[ idx[[1]] ], pred)
##              pred
## true          cerebellum colon endometrium hippocampus kidney liver
##   cerebellum           2     0           0           1      0     0
##   colon                0     4           0           0      0     0
##   endometrium          0     0           1           0      1     0
##   hippocampus          1     0           0           2      0     0
##   kidney               0     0           0           0      4     0
##   liver                0     0           0           0      0     2
mean(y[ idx[[1]] ] != pred)
## [1] 0.1666667

Now we have some misclassifications. How well do we do for the rest of the folds?

for (i in 1:10) {
  pred <- knn(train=Xsmall[ -idx[[i]] , ], test=Xsmall[ idx[[i]], ], cl=y[ -idx[[i]] ], k=5)
  print(paste0(i,") error rate: ", round(mean(y[ idx[[i]] ] != pred),3)))
}
## [1] "1) error rate: 0.167"
## [1] "2) error rate: 0.105"
## [1] "3) error rate: 0.118"
## [1] "4) error rate: 0.118"
## [1] "5) error rate: 0.278"
## [1] "6) error rate: 0.05"
## [1] "7) error rate: 0.105"
## [1] "8) error rate: 0.211"
## [1] "9) error rate: 0.15"
## [1] "10) error rate: 0.312"

So we can see there is some variation for each fold, with errors rates hovering around 0.1-0.3. But is k=5 the best setting for the k parameter? In order to explore the best setting for k, we need to create an outer loop, where we try different values for k, and then calculate the average test set error across all the folds.

We will try out each value of k from 1 to 12. Instead of using two for loops, we will use sapply:

set.seed(1)
ks <- 1:12
res <- sapply(ks, function(k) {
  ##try out each version of k from 1 to 12
  res.k <- sapply(seq_along(idx), function(i) {
    ##loop over each of the 10 cross-validation folds
    ##predict the held-out samples using k nearest neighbors
    pred <- knn(train=Xsmall[ -idx[[i]], ],
                test=Xsmall[ idx[[i]], ],
                cl=y[ -idx[[i]] ], k = k)
    ##the ratio of misclassified samples
    mean(y[ idx[[i]] ] != pred)
  })
  ##average over the 10 folds
  mean(res.k)
})

Now for each value of k, we have an associated test set error rate from the cross-validation procedure.

res
##  [1] 0.1978212 0.1703423 0.1882933 0.1750989 0.1613291 0.1500791 0.1552670
##  [8] 0.1884813 0.1822020 0.1763197 0.1761318 0.1813197

We can then plot the error rate for each value of k, which helps us to see in what region there might be a minimal error rate:

plot(ks, res, type="o",ylab="misclassification error")
Misclassification error versus number of neighbors.

Misclassification error versus number of neighbors.

Remember, because the training set is a random sample and because our fold-generation procedure involves random number generation, the “best” value of k we pick through this procedure is also a random variable. If we had new training data and if we recreated our folds, we might get a different value for the optimal k.

Finally, to show that gene expression can perfectly predict tissue, we use 5 dimensions instead of 2, which results in perfect prediction:

Xsmall <- cmdscale(dist(X),k=5)
set.seed(1)
ks <- 1:12
res <- sapply(ks, function(k) {
  res.k <- sapply(seq_along(idx), function(i) {
    pred <- knn(train=Xsmall[ -idx[[i]], ],
                test=Xsmall[ idx[[i]], ],
                cl=y[ -idx[[i]] ], k = k)
    mean(y[ idx[[i]] ] != pred)
  })
  mean(res.k)
})
plot(ks, res, type="o",ylim=c(0,0.20),ylab="misclassification error")
Misclassification error versus number of neighbors when we use five dimensions instead of 2.

Misclassification error versus number of neighbors when we use five dimensions instead of 2.

Important note: we applied cmdscale to the entire dataset to create a smaller one for illustration purposes. However, in a real machine learning application, this may result in an underestimation of test set error for small sample sizes, where dimension reduction using the unlabeled full dataset gives a boost in performance. A safer choice would have been to transform the data separately for each fold, by calculating a rotation and dimension reduction using the training set only and applying this to the test set.

kNN and Cross Validation Exercises

Load the following dataset

library(GSE5859Subset)
data(GSE5859Subset)
#And define the outcome and predictors. To make the problem more difficult we will only consider autosomal genes:

y = factor(sampleInfo$group)
X = t(geneExpression)
out = which(geneAnnotation$CHR%in%c("chrX","chrY"))
X = X[,-out]

#Note, you will also need to load the following:
library(caret)

kNN and Cross Validation Exercises #1

Set the seed to 1 set.seed(1) then use the createFolds function in the caret package to create 10 folds of y.

What is the 2nd entry in the fold 3?

library(caret)
set.seed(1)
idx = createFolds(y, k=10)
idx[[3]][2]
## [1] 15
sapply(idx,function(ind) table(y[ind])) ##make sure every fold has 0s and 1s
##   Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## 0      1      1      2      2      1      1      1      1      1      1
## 1      1      1      1      1      2      1      1      2      1      1

A:15

kNN and Cross Validation Exercises #2

For the following questions we are going to use kNN. We are going to consider a smaller set of predictors by filtering genes using t-tests. Specifically, we will perform a t-test and select the [mathjaxinline]m[/mathjaxinline] genes with the smallest p-values.

Let [mathjaxinline]m=8[/mathjaxinline] and [mathjaxinline]k=5[/mathjaxinline] and train kNN by leaving out the second fold idx[[2]]

How many mistakes do we make on the test set? Remember it is indispensable that you perform the ttest on the training data.

library(class)
library(genefilter)
m=8
k=5
ind = idx[[2]]
pvals = rowttests(t(X[-ind,]),factor(y[-ind]))$p.val
ind2 = order(pvals)[1:m]
predict=knn(X[-ind,ind2],X[ind,ind2],y[-ind],k=k)
sum(predict!=y[ind])
## [1] 1

A:1

kNN and Cross Validation Exercises #3

Now run the code for kNN and Cross Validation Exercises #2 for all 10 folds and keep track of the errors. What is our error rate (number of errors divided by number of predictions) ?

library(class)
library(genefilter)
m=8
k=5
result = sapply(idx,function(ind){
  pvals = rowttests(t(X[-ind,]),factor(y[-ind]))$p.val
  ind2 = order(pvals)[1:m]
  predict=knn(X[-ind,ind2],X[ind,ind2],y[-ind],k=k)
  sum(predict!=y[ind])
})
sum(result)/length(y)
## [1] 0.375

A:0.375

kNN and Cross Validation Exercises #4

Now we are going to select the best values of k and m. Use the expand grid function to try out the following values:

ms=2^c(1:11)
ks=seq(1,9,2)
params = expand.grid(k=ks,m=ms)

Now use apply or a loop to obtain error rates for each of these pairs of parameters. Which pair of parameters minimizes the error rate?

errors = apply(params,1,function(param){
  k =  param[1]
  m =  param[2]
  result = sapply(idx,function(ind){
    pvals = rowttests(t(X[-ind,]),factor(y[-ind]))$p.val
    ind2 = order(pvals)[1:m]
    predict=knn(X[-ind,ind2],X[ind,ind2],y[-ind],k=k)
    sum(predict!=y[ind])
  })
  sum(result)/length(y)
  })
params[which.min(errors),]
##    k    m
## 47 3 1024
##make a plot and confirm its just one min:
errors = matrix(errors,5,11)
library(rafalib)
mypar(1,1)
matplot(ms,t(errors),type="l",log="x")
legend("topright",as.character(ks),lty=seq_along(ks),col=seq_along(ks))

A:k=3
m=1024

kNN and Cross Validation Exercises #5

Repeat question kNN and Cross Validation Exercises #4 but now perform the t-test filtering before the cross validation. Note how this biases the entire result and gives us much lower estimated error rates.

What is the minimum error rate?

pvals = rowttests(t(X),factor(y))$p.val
errors = apply(params,1,function(param){
  k =  param[1]
  m =  param[2]
  result = sapply(idx,function(ind){
    ind2 = order(pvals)[1:m]
    predict=knn(X[-ind,ind2],X[ind,ind2],y[-ind],k=k)
    sum(predict!=y[ind])
  })
  sum(result)/length(y)
  })
min(errors)
## [1] 0.08333333
##make a plot and compare to previous question
errors = matrix(errors,5,11)
library(rafalib)
mypar(1,1)
matplot(ms,t(errors),type="l",log="x")
legend("topright",as.character(ks),lty=seq_along(ks),col=seq_along(ks))

A:0.08333333

kNN and Cross Validation Exercises #6

Repeat the cross-validation we performed in question kNN and Cross Validation Exercises but now instead of defining y as sampleInfo$group use:

y = factor(as.numeric(format( sampleInfo$date, "%m")=="06"))

What is the minimum error rate now?

errors = apply(params,1,function(param){
  k =  param[1]
  m =  param[2]
  result = sapply(idx,function(ind){
    pvals = rowttests(t(X[-ind,]),factor(y[-ind]))$p.val
    ind2 = order(pvals)[1:m]
    predict=knn(X[-ind,ind2],X[ind,ind2],y[-ind],k=k)
    sum(predict!=y[ind])
  })
  sum(result)/length(y)
  })
min(errors)
## [1] 0
##make a plot and confirm its just one min:
errors = matrix(errors,5,11)
library(rafalib)
mypar(1,1)
matplot(ms,t(errors),type="l",log="x")
legend("topright",as.character(ks),lty=seq_along(ks),col=seq_along(ks))

A:0

Note that we achieve much lower error rate when predicting date than when predicting the group. Because group is confounded with date, it is very possible that these predictors have no information about group and that our lower 0.5 error rates are due to the confounding with date. We will learn more about this in the batch effects section.

Resources

https://github.com/genomicsclass

Html book

devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.3.2 (2016-10-31)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_India.1252          
##  tz       Asia/Calcutta               
##  date     2017-01-08
## Packages ------------------------------------------------------------------
##  package               * version  date      
##  affy                  * 1.52.0   2016-10-18
##  affyio                  1.44.0   2016-10-18
##  annotate                1.52.1   2016-12-22
##  AnnotationDbi           1.36.0   2016-10-18
##  assertthat              0.1      2013-12-06
##  backports               1.0.4    2016-10-24
##  base64enc             * 0.1-3    2015-07-28
##  Biobase               * 2.34.0   2016-10-18
##  BiocGenerics          * 0.20.0   2016-10-18
##  BiocInstaller           1.24.0   2016-10-18
##  bitops                  1.0-6    2013-08-17
##  car                     2.1-4    2016-12-02
##  caret                 * 6.0-73   2016-11-10
##  caTools                 1.17.1   2014-09-10
##  class                 * 7.3-14   2015-08-30
##  codetools               0.2-15   2016-10-05
##  colorspace              1.3-2    2016-12-14
##  curl                  * 2.3      2016-11-24
##  DataComputing         * 0.8.3    2016-03-19
##  DBI                     0.5-1    2016-09-10
##  devtools                1.12.0   2016-06-24
##  digest                  0.6.11   2017-01-03
##  dplyr                 * 0.5.0    2016-06-24
##  evaluate                0.10     2016-10-11
##  foreach                 1.4.3    2015-10-13
##  gdata                   2.17.0   2015-07-04
##  genefilter            * 1.56.0   2016-10-18
##  GenomeInfoDb          * 1.10.2   2016-12-29
##  GenomicRanges         * 1.26.2   2017-01-02
##  ggdendro                0.1-20   2016-04-27
##  ggplot2               * 2.2.1    2016-12-30
##  gplots                * 3.0.1    2016-03-30
##  gridExtra               2.2.1    2016-02-29
##  GSE5859Subset         * 1.0      2016-05-09
##  gtable                  0.2.0    2016-02-26
##  gtools                  3.5.0    2015-05-29
##  hgu95acdf             * 2.18.0   2017-01-08
##  highr                   0.6      2016-05-09
##  htmltools               0.3.5    2016-03-21
##  IRanges               * 2.8.1    2016-11-08
##  iterators               1.0.8    2015-10-13
##  KernSmooth              2.23-15  2015-06-29
##  knitr                 * 1.15.1   2016-11-22
##  lattice               * 0.20-34  2016-09-06
##  lazyeval                0.2.0    2016-06-12
##  lme4                    1.1-12   2016-04-16
##  lubridate             * 1.6.0    2016-09-13
##  magrittr                1.5      2014-11-22
##  manipulate            * 1.0.1    2014-12-24
##  MASS                  * 7.3-45   2016-04-21
##  Matrix                * 1.2-7.1  2016-09-01
##  MatrixModels            0.4-1    2015-08-22
##  matrixStats           * 0.51.0   2016-10-09
##  memoise                 1.0.0    2016-01-29
##  mgcv                    1.8-15   2016-09-14
##  mime                    0.5      2016-07-07
##  minqa                   1.2.4    2014-10-09
##  ModelMetrics            1.1.0    2016-08-26
##  mosaic                * 0.14.4   2016-07-29
##  mosaicData            * 0.14.0   2016-06-17
##  munsell                 0.4.3    2016-02-13
##  nlme                    3.1-128  2016-05-10
##  nloptr                  1.0.4    2014-08-04
##  nnet                    7.3-12   2016-02-02
##  parathyroidSE         * 1.12.0   2017-01-08
##  pbkrtest                0.4-6    2016-01-27
##  plyr                    1.8.4    2016-06-08
##  preprocessCore          1.36.0   2016-10-18
##  quantreg                5.29     2016-09-04
##  R6                      2.2.0    2016-10-05
##  rafalib               * 1.0.0    2015-08-09
##  RColorBrewer          * 1.1-2    2014-12-07
##  Rcpp                    0.12.8   2016-11-17
##  RCurl                   1.95-4.8 2016-03-01
##  reshape2                1.4.2    2016-10-22
##  rmarkdown               1.3      2016-12-21
##  rprojroot               1.1      2016-10-29
##  RSQLite                 1.1-1    2016-12-10
##  S4Vectors             * 0.12.1   2016-12-01
##  scales                  0.4.1    2016-11-09
##  SparseM                 1.74     2016-11-10
##  SpikeIn               * 1.16.0   2017-01-08
##  stringi                 1.1.2    2016-10-01
##  stringr               * 1.1.0    2016-08-19
##  SummarizedExperiment  * 1.4.0    2016-10-18
##  survival                2.40-1   2016-10-30
##  tibble                  1.2      2016-08-26
##  tidyr                 * 0.6.0    2016-08-12
##  tissuesGeneExpression * 1.0      2016-05-13
##  withr                   1.0.2    2016-06-20
##  XML                     3.98-1.5 2016-11-10
##  xtable                  1.8-2    2016-02-05
##  XVector                 0.14.0   2016-10-18
##  yaml                    2.1.14   2016-11-12
##  zlibbioc                1.20.0   2016-10-18
##  source                                              
##  Bioconductor                                        
##  Bioconductor                                        
##  Bioconductor                                        
##  Bioconductor                                        
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.3.0)                                      
##  Bioconductor                                        
##  Bioconductor                                        
##  Bioconductor                                        
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.2.2)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.2)                                      
##  Github (DataComputing/DataComputing@fb983c7)        
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.2.1)                                      
##  Bioconductor                                        
##  Bioconductor                                        
##  Bioconductor                                        
##  CRAN (R 3.2.5)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.2.4)                                      
##  CRAN (R 3.2.3)                                      
##  Github (genomicsclass/GSE5859Subset@8ada5f4)        
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.2.0)                                      
##  Bioconductor                                        
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.2.4)                                      
##  Bioconductor                                        
##  CRAN (R 3.2.2)                                      
##  CRAN (R 3.2.1)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.2.5)                                      
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.2.1)                                      
##  CRAN (R 3.2.0)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.2.2)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.2.1)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.2.1)                                      
##  CRAN (R 3.2.3)                                      
##  Bioconductor                                        
##  CRAN (R 3.2.3)                                      
##  CRAN (R 3.3.0)                                      
##  Bioconductor                                        
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.2.2)                                      
##  CRAN (R 3.2.0)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.2)                                      
##  Bioconductor                                        
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.2)                                      
##  Bioconductor                                        
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.3.1)                                      
##  Bioconductor                                        
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.3.1)                                      
##  CRAN (R 3.3.1)                                      
##  Github (genomicsclass/tissuesGeneExpression@a43cf4b)
##  CRAN (R 3.3.0)                                      
##  CRAN (R 3.3.2)                                      
##  CRAN (R 3.2.3)                                      
##  Bioconductor                                        
##  CRAN (R 3.3.2)                                      
##  Bioconductor
#sessionInfo()$otherPkgs