Source file ⇒ Data_Analysis_for_Life_Sciences4.rmd
http://genomicsclass.github.io/book/pages/distance.html
High Dimensional Data Analysis:Distance and Dimension Reduction
The concept of distance is quite intuitive. For example, when we cluster animals into subgroups, we are implicitly defining a distance that permits us to say what animals are “close” to each other.
Clustering of animals
Many of the analyses we perform with high-dimensional data relate directly or indirectly to distance. Many clustering and machine learning techniques rely on being able to define distance, using features or predictors. For example, to create heatmaps, which are widely used in genomics and other high-throughput fields, a distance is computed explicitly.
Example of heatmap
Image Source: Heatmap, Gaeddal, 01.28.2007
In these plots the measurements, which are stored in a matrix, are represented with colors after the columns and rows have been clustered. (A side note: red and green, a common color theme for heatmaps, are two of the most difficult colors for many color-blind people to discern.) Here we will learn the necessary mathematics and computing skills to understand and create heatmaps. We start by reviewing the mathematical definition of distance.
As a review, let’s define the distance between two points, \(A\) and \(B\), on a Cartesian plane.
The euclidean distance between \(A\) and \(B\) is simply:
\[\sqrt{ (A_x-B_x)^2 + (A_y-B_y)^2}\]
We introduce a dataset with gene expression measurements for 22,215 genes from 189 samples. The R objects can be downloaded like this:
#library(devtools)
#install_github("genomicsclass/tissuesGeneExpression")The data represent RNA expression levels for eight tissues, each with several individuals.
library(tissuesGeneExpression)
data(tissuesGeneExpression)
dim(e) ##e contains the expression data## [1] 22215 189
#write.csv(e,"tissuesGeneExpression.csv")
tissue## [1] "kidney" "kidney" "kidney" "kidney" "kidney"
## [6] "kidney" "kidney" "kidney" "kidney" "kidney"
## [11] "kidney" "kidney" "kidney" "kidney" "kidney"
## [16] "kidney" "hippocampus" "hippocampus" "hippocampus" "hippocampus"
## [21] "hippocampus" "hippocampus" "hippocampus" "hippocampus" "hippocampus"
## [26] "hippocampus" "hippocampus" "hippocampus" "hippocampus" "hippocampus"
## [31] "hippocampus" "hippocampus" "hippocampus" "hippocampus" "hippocampus"
## [36] "hippocampus" "hippocampus" "hippocampus" "hippocampus" "hippocampus"
## [41] "hippocampus" "hippocampus" "hippocampus" "hippocampus" "hippocampus"
## [46] "hippocampus" "cerebellum" "cerebellum" "cerebellum" "cerebellum"
## [51] "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum"
## [56] "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum"
## [61] "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum"
## [66] "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum"
## [71] "cerebellum" "cerebellum" "cerebellum" "kidney" "kidney"
## [76] "kidney" "kidney" "kidney" "kidney" "kidney"
## [81] "kidney" "kidney" "kidney" "kidney" "kidney"
## [86] "kidney" "colon" "colon" "colon" "colon"
## [91] "colon" "colon" "colon" "colon" "colon"
## [96] "colon" "colon" "colon" "colon" "colon"
## [101] "colon" "colon" "colon" "colon" "colon"
## [106] "colon" "colon" "colon" "colon" "colon"
## [111] "colon" "colon" "colon" "colon" "colon"
## [116] "colon" "colon" "colon" "colon" "colon"
## [121] "kidney" "kidney" "kidney" "liver" "liver"
## [126] "liver" "kidney" "kidney" "kidney" "liver"
## [131] "liver" "liver" "kidney" "kidney" "cerebellum"
## [136] "hippocampus" "liver" "liver" "liver" "liver"
## [141] "liver" "liver" "cerebellum" "cerebellum" "liver"
## [146] "liver" "kidney" "kidney" "endometrium" "endometrium"
## [151] "endometrium" "endometrium" "endometrium" "endometrium" "endometrium"
## [156] "endometrium" "endometrium" "endometrium" "endometrium" "endometrium"
## [161] "endometrium" "endometrium" "endometrium" "liver" "liver"
## [166] "liver" "liver" "liver" "liver" "liver"
## [171] "liver" "liver" "liver" "liver" "liver"
## [176] "cerebellum" "cerebellum" "cerebellum" "cerebellum" "cerebellum"
## [181] "cerebellum" "cerebellum" "cerebellum" "placenta" "placenta"
## [186] "placenta" "placenta" "placenta" "placenta"
table(tissue) ##tissue[i] tells us what tissue is represented by e[,i]## tissue
## cerebellum colon endometrium hippocampus kidney liver
## 38 34 15 31 39 26
## placenta
## 6
We are interested in describing distance between samples in the context of this dataset. We might also be interested in finding genes that behave similarly across samples.
To define distance, we need to know what the points are since mathematical distance is computed between points. With high dimensional data, points are no longer on the Cartesian plane. Instead they are in higher dimensions. For example, sample \(i\) is defined by a point in 22,215 dimensional space: \((Y_{1,i},\dots,Y_{22215,i})^\top\). Feature \(g\) is defined by a point in 189 dimensions \((Y_{g,189},\dots,Y_{g,189})^\top\)
Once we define points, the Euclidean distance is defined in a very similar way as it is defined for two dimensions. For instance, the distance between two samples \(i\) and \(j\) is:
\[ \mbox{dist}(i,j) = \sqrt{ \sum_{g=1}^{22215} (Y_{g,i}-Y_{g,j })^2 } \]
and the distance between two features \(h\) and \(g\) is:
\[ \mbox{dist}(h,g) = \sqrt{ \sum_{i=1}^{189} (Y_{h,i}-Y_{g,i})^2 } \]
Note: In practice, distance between features are typically applied after standardizing the data for each feature. This is equivalent to computing one minus the correlation. This is done because the differences in overall levels between features are often not due to biological effects but technical ones. More details on this topic can be found in this presentation.
The distance between samples \(i\) and \(j\) can be written as
\[ \mbox{dist}(i,j) = (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j)\]
with \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) columns \(i\) and \(j\). This result can be very convenient in practice as computations can be made much faster using matrix multiplication.
We can now use the formulas above to compute distance. Let’s compute distance between samples 1 and 2, both kidneys, and then to sample 87, a colon.
x <- e[,1]
y <- e[,2]
z <- e[,87]
tissue[c(1,2,87)]## [1] "kidney" "kidney" "colon"
sqrt(sum((x-y)^2))## [1] 85.8546
sqrt(sum((x-z)^2))## [1] 122.8919
As expected, the kidneys are closer to each other. A faster way to compute this is using matrix algebra:
sqrt( crossprod(x-y) )## [,1]
## [1,] 85.8546
sqrt( crossprod(x-z) )## [,1]
## [1,] 122.8919
Now to compute all the distances at once, we have the function dist. Because it computes the distance between each row, and here we are interested in the distance between samples, we transpose the matrix
#?dist
d <- dist(t(e))
class(d)## [1] "dist"
Note that this produces an object of class dist and, to access the entries using row and column, indexes we need to coerce it into a matrix:
as.matrix(d)[1,2]## [1] 85.8546
as.matrix(d)[1,87]## [1] 122.8919
image(as.matrix(d))You could use the image function and just type as.matrix.d, and you’re going to get a color image of all the distances,so that’s that 189 by 189 matrix. The top triangle and the bottom triangle should be the same because the distance between x and y is the same as the distance between y and x.And you can see these red squares.Those red squares are there because the samples are somewhat ordered by tissue so you have tissues that are close on the matrix are also close in biology or biology wise close,and then you get these reddish squares representing the distance between tissues of the same kind. Red is smaller and yellow is larger.OK, so that’s how we compute distance using R.End of transcript. Skip to the start.Downloads and transcripts Video
It is important to remember that if we run dist on e, it will compute all pairwise distances between genes. This will try to create a \(22215 \times 22215\) matrix that may crash your R sessions.
http://genomicsclass.github.io/book/pages/distance_exercises.html
Distance Exercises #1
If you have not done so already, install the data package tissueGeneExpression.
# library(devtools)
# install_github("genomicsclass/tissuesGeneExpression")The data represents RNA expression levels for seven tissues, each with several biological replictes. We call samples that we consider to be from the same population, such as liver tissue from different individuals,biological replictes:
library(tissuesGeneExpression)
data(tissuesGeneExpression)
head(e)## GSM11805.CEL.gz GSM11814.CEL.gz GSM11823.CEL.gz GSM11830.CEL.gz
## 1007_s_at 10.191267 10.509167 10.272027 10.252952
## 1053_at 6.040463 6.696075 6.144663 6.575153
## 117_at 7.447409 7.775354 7.696235 8.478135
## 121_at 12.025042 12.007817 11.633279 11.075286
## 1255_g_at 5.269269 5.180389 5.301714 5.372235
## 1294_at 8.535176 8.587241 8.277414 8.603650
## GSM12067.CEL.gz GSM12075.CEL.gz GSM12079.CEL.gz GSM12098.CEL.gz
## 1007_s_at 10.157605 9.966782 9.839348 9.945652
## 1053_at 6.606701 6.060069 6.186596 5.927861
## 117_at 8.116336 7.644452 8.009581 7.847192
## 121_at 10.832528 11.705062 11.706145 11.750370
## 1255_g_at 5.334905 5.253682 5.228794 5.155278
## 1294_at 8.303227 8.341625 8.947253 8.389109
## GSM12105.CEL.gz GSM12268.CEL.gz GSM12270.CEL.gz GSM12283.CEL.gz
## 1007_s_at 9.913031 10.170344 9.457415 10.126887
## 1053_at 6.337478 6.045789 6.280457 5.882025
## 117_at 7.983850 7.544486 8.571040 7.677892
## 121_at 10.706184 11.760161 10.716936 11.664962
## 1255_g_at 5.236442 5.405336 5.343847 5.098063
## 1294_at 8.513929 8.227427 9.068964 7.987842
## GSM12298.CEL.gz GSM12300.CEL.gz GSM12399.CEL.gz GSM12444.CEL.gz
## 1007_s_at 10.466426 9.969730 10.512177 9.954500
## 1053_at 6.203864 5.987046 6.449748 6.161354
## 117_at 7.704939 7.615484 8.474977 7.491579
## 121_at 11.657138 11.436698 10.961178 11.554792
## 1255_g_at 5.258476 5.381385 5.323794 5.238425
## 1294_at 8.335472 8.119172 8.522059 8.424942
## GSM21203.cel.gz GSM21204.cel.gz GSM21205.cel.gz GSM21206.cel.gz
## 1007_s_at 11.166118 10.919300 11.067899 11.157507
## 1053_at 5.853414 5.661951 5.785463 5.831846
## 117_at 7.447074 7.386950 8.160359 11.767046
## 121_at 10.623943 10.986785 11.150964 10.598356
## 1255_g_at 5.268522 5.496033 5.334942 5.204861
## 1294_at 8.401319 7.931769 8.557533 7.605685
## GSM21207.cel.gz GSM21208.cel.gz GSM21209.cel.gz GSM21210.cel.gz
## 1007_s_at 11.390987 10.889327 11.667431 11.222451
## 1053_at 5.924852 5.652696 5.785190 5.712906
## 117_at 7.956786 7.970167 7.666343 7.195045
## 121_at 11.111799 10.926900 11.010148 10.717852
## 1255_g_at 4.956953 5.233504 5.336258 5.445535
## 1294_at 8.439095 7.914012 8.323745 8.154334
## GSM21212.cel.gz GSM21213.cel.gz GSM21214.cel.gz GSM21215.cel.gz
## 1007_s_at 11.898028 11.400538 10.955428 10.361340
## 1053_at 5.676397 5.676046 5.776781 5.663634
## 117_at 7.474182 7.927246 7.493743 7.328729
## 121_at 10.801148 11.029615 10.688563 10.498799
## 1255_g_at 5.260948 5.326644 5.243714 6.086277
## 1294_at 8.444386 8.250350 7.797140 7.648405
## GSM21216.cel.gz GSM21217.cel.gz GSM21218.cel.gz GSM21219.cel.gz
## 1007_s_at 10.771302 10.272978 10.757734 11.371014
## 1053_at 5.791955 5.713266 5.984170 5.736768
## 117_at 7.364371 7.181879 8.525524 7.441310
## 121_at 10.519630 10.375178 10.546843 10.732846
## 1255_g_at 6.872226 7.220032 5.277552 5.532549
## 1294_at 7.790623 7.808684 7.655907 8.105180
## GSM21220.cel.gz GSM21221.cel.gz GSM21222.cel.gz GSM21223.cel.gz
## 1007_s_at 10.914514 10.573316 11.226525 11.243415
## 1053_at 5.850332 5.844528 5.820929 5.768447
## 117_at 9.212210 7.257813 7.564929 7.552887
## 121_at 10.667456 10.701537 11.001622 10.855300
## 1255_g_at 5.933984 6.002629 5.455578 5.393086
## 1294_at 7.903249 7.840000 7.927451 8.206483
## GSM21224.cel.gz GSM21225.cel.gz GSM21226.cel.gz GSM21227.cel.gz
## 1007_s_at 11.575293 11.439544 11.009120 10.642950
## 1053_at 5.874935 5.985295 5.750322 5.608566
## 117_at 7.517278 7.360212 7.508019 7.633826
## 121_at 10.740967 10.677039 11.123188 10.565005
## 1255_g_at 5.325502 5.586701 5.248449 5.257288
## 1294_at 8.086701 7.970317 8.006025 7.946024
## GSM21228.cel.gz GSM21229.cel.gz GSM21230.cel.gz GSM21231.cel.gz
## 1007_s_at 11.296292 11.239662 11.496114 11.077318
## 1053_at 5.822449 6.067430 5.760156 5.957667
## 117_at 7.947366 7.521374 7.787561 7.908426
## 121_at 10.753435 10.868629 10.662749 10.783401
## 1255_g_at 5.530244 5.280227 5.230454 5.191462
## 1294_at 8.080371 7.847783 8.369287 7.818008
## GSM21232.cel.gz GSM21233.cel.gz GSM87058.cel.gz GSM87064.cel.gz
## 1007_s_at 11.032670 10.733382 9.528979 9.674865
## 1053_at 5.688423 5.822977 5.930689 5.903833
## 117_at 8.080432 7.477119 7.423056 8.999608
## 121_at 11.064665 11.096579 11.051851 11.030058
## 1255_g_at 5.177035 5.135013 5.152332 5.122464
## 1294_at 8.354879 7.919090 7.658205 7.688351
## GSM87065.cel.gz GSM87066.cel.gz GSM87071.cel.gz GSM87072.cel.gz
## 1007_s_at 9.575169 9.746007 9.804795 9.597073
## 1053_at 5.901084 5.886079 5.967831 5.760703
## 117_at 7.570569 7.459517 7.396987 7.632634
## 121_at 10.959841 10.811034 10.983345 11.160525
## 1255_g_at 5.155192 5.251132 5.114582 5.031275
## 1294_at 7.546350 7.682545 7.342646 7.583204
## GSM87073.cel.gz GSM87074.cel.gz GSM87075.cel.gz GSM87085.cel.gz
## 1007_s_at 9.611386 9.859942 10.318618 9.864295
## 1053_at 5.911010 6.025677 5.856556 5.753874
## 117_at 7.255444 7.385739 7.048070 7.712646
## 121_at 10.930774 10.828727 10.898723 10.842528
## 1255_g_at 5.135653 5.026157 5.090865 5.264874
## 1294_at 7.547011 7.494077 7.523983 7.657002
## GSM87086.cel.gz GSM87087.cel.gz GSM87088.cel.gz GSM87089.cel.gz
## 1007_s_at 9.798633 9.758049 9.590453 10.382722
## 1053_at 5.862426 5.904910 5.963214 6.084209
## 117_at 7.279199 7.382182 8.185911 7.330931
## 121_at 10.929617 10.822504 11.194947 10.630318
## 1255_g_at 5.087763 5.183853 5.334466 5.152213
## 1294_at 7.674958 7.513357 7.440385 7.547589
## GSM87090.CEL.gz GSM87091.cel.gz GSM87092.cel.gz GSM87093.cel.gz
## 1007_s_at 9.744292 9.748162 9.801949 9.540319
## 1053_at 6.076781 6.150398 6.111117 6.107313
## 117_at 7.563120 7.356163 7.409661 7.244511
## 121_at 10.585539 10.899328 10.956315 10.623056
## 1255_g_at 5.192697 5.107863 5.115326 5.285647
## 1294_at 7.581032 7.521760 7.453925 7.441884
## GSM87094.cel.gz GSM87095.cel.gz GSM87096.cel.gz GSM87097.cel.gz
## 1007_s_at 9.753415 9.616345 10.049081 9.725197
## 1053_at 6.169774 6.194195 6.003842 6.125768
## 117_at 7.338395 7.308199 7.349784 7.310697
## 121_at 10.672483 10.784685 11.023774 10.853290
## 1255_g_at 5.138855 5.150373 5.070798 5.085144
## 1294_at 7.545716 7.565719 7.821006 7.728316
## GSM87098.cel.gz GSM87099.cel.gz GSM87100.cel.gz GSM87101.cel.gz
## 1007_s_at 9.994052 9.695028 9.654518 9.709508
## 1053_at 6.134039 6.214735 6.057829 6.128327
## 117_at 7.784562 7.447524 7.712780 7.421849
## 121_at 11.117425 10.945455 10.739740 10.924573
## 1255_g_at 5.062101 5.259323 5.022970 5.174359
## 1294_at 7.657950 7.770457 7.469018 7.659710
## GSM87102.cel.gz GSM146778.CEL.gz GSM146779.CEL.gz
## 1007_s_at 9.578900 9.864295 9.949632
## 1053_at 6.181457 5.715836 6.582316
## 117_at 7.550353 7.420220 7.651495
## 121_at 10.761253 11.754427 11.518203
## 1255_g_at 5.167119 5.098613 5.091193
## 1294_at 7.558538 8.059662 8.773412
## GSM146781.CEL.gz GSM146782.CEL.gz GSM146783.CEL.gz
## 1007_s_at 10.771730 10.381994 10.008225
## 1053_at 6.381538 5.810887 5.922002
## 117_at 7.795136 7.434689 7.720692
## 121_at 11.258240 11.952897 10.847877
## 1255_g_at 5.126175 5.167193 5.080195
## 1294_at 8.154487 7.986321 8.496441
## GSM146785.CEL.gz GSM146787.CEL.gz GSM146788.CEL.gz
## 1007_s_at 10.067104 10.355680 10.318169
## 1053_at 6.186499 5.921369 6.210221
## 117_at 7.544709 7.730206 7.958453
## 121_at 10.941188 11.325783 11.502778
## 1255_g_at 5.061560 5.117684 5.103714
## 1294_at 8.034500 8.344711 8.174741
## GSM146789.CEL.gz GSM146791.CEL.gz GSM146793.CEL.gz
## 1007_s_at 10.180608 10.330264 9.984370
## 1053_at 5.816531 6.186073 6.190721
## 117_at 7.293380 7.487062 7.423247
## 121_at 11.829316 11.462597 10.831227
## 1255_g_at 5.112861 5.141480 5.127723
## 1294_at 7.896776 7.820158 8.256471
## GSM146795.CEL.gz GSM146797.CEL.gz GSM92240.CEL.gz
## 1007_s_at 9.903414 10.559245 10.658694
## 1053_at 6.229082 6.356300 6.415478
## 117_at 8.537829 8.169326 8.087378
## 121_at 11.312691 11.537642 10.605016
## 1255_g_at 5.068838 5.241162 5.082552
## 1294_at 8.000838 8.110253 8.273383
## GSM92241.CEL.gz GSM92242.CEL.gz GSM92243.CEL.gz GSM92244.CEL.gz
## 1007_s_at 10.967854 10.373213 11.395477 10.822040
## 1053_at 6.439052 6.523120 6.280099 6.377340
## 117_at 7.892145 7.625365 7.829470 7.461025
## 121_at 10.416664 10.644904 10.669002 10.332522
## 1255_g_at 5.172805 5.168378 5.207066 5.152687
## 1294_at 8.296768 7.853151 7.552782 7.998216
## GSM92245.CEL.gz GSM92247.CEL.gz GSM92248.CEL.gz GSM92249.CEL.gz
## 1007_s_at 10.077308 10.856773 11.185204 11.039855
## 1053_at 6.287809 6.209920 6.587097 6.239914
## 117_at 7.806562 7.663545 7.655907 7.939491
## 121_at 10.880915 10.643931 10.709260 10.844060
## 1255_g_at 4.929143 4.954293 5.074027 5.115084
## 1294_at 8.058602 8.162227 8.494932 7.652769
## GSM92250.CEL.gz GSM92253.CEL.gz GSM92254.CEL.gz GSM92255.CEL.gz
## 1007_s_at 11.079038 11.345528 11.682068 10.899083
## 1053_at 6.662952 6.675912 6.274149 6.622457
## 117_at 7.298781 7.560152 7.505362 7.723532
## 121_at 10.725834 10.690677 10.722859 10.595043
## 1255_g_at 5.179113 5.115800 5.000593 5.049459
## 1294_at 8.275607 7.974583 7.912150 8.109445
## GSM92256.CEL.gz GSM92257.CEL.gz GSM92258.CEL.gz GSM92259.CEL.gz
## 1007_s_at 10.979619 11.148514 11.156803 11.047648
## 1053_at 6.001221 6.147199 6.448577 6.695092
## 117_at 7.944837 8.288436 7.468884 7.595644
## 121_at 10.794864 11.005032 10.759466 10.729882
## 1255_g_at 5.065279 5.076212 4.986409 5.131755
## 1294_at 8.301471 8.424359 8.237984 8.699576
## GSM92260.CEL.gz GSM92261.CEL.gz GSM92262.CEL.gz GSM92263.CEL.gz
## 1007_s_at 11.209538 11.120316 10.599085 10.742845
## 1053_at 6.660767 6.293271 5.891772 6.519291
## 117_at 7.361785 7.715139 8.029006 7.376248
## 121_at 10.723057 10.645030 11.081891 10.545488
## 1255_g_at 5.114728 5.159226 5.036999 5.154628
## 1294_at 8.112350 8.146838 8.332777 7.811800
## GSM92264.CEL.gz GSM92265.CEL.gz GSM92266.CEL.gz GSM92267.CEL.gz
## 1007_s_at 11.266456 10.716785 10.828210 11.321320
## 1053_at 6.236967 6.233576 6.643156 5.946477
## 117_at 7.433968 7.760819 8.373691 8.265794
## 121_at 10.458710 10.785353 10.664612 11.039494
## 1255_g_at 5.150165 5.088988 5.157535 5.095064
## 1294_at 8.193679 7.780700 8.627630 8.301142
## GSM92268.CEL.gz GSM92269.CEL.gz GSM92270.CEL.gz GSM92271.CEL.gz
## 1007_s_at 11.152870 10.646973 10.586756 10.652296
## 1053_at 6.729296 6.349878 6.417514 6.318650
## 117_at 7.569348 7.584153 7.492194 7.790239
## 121_at 10.763125 10.402084 10.861418 10.677872
## 1255_g_at 5.053854 5.262421 5.022686 5.251561
## 1294_at 7.988804 8.188457 8.666162 8.467416
## GSM92272.CEL.gz GSM92273.CEL.gz GSM92274.CEL.gz GSM92275.CEL.gz
## 1007_s_at 10.725555 11.276537 10.817390 10.391968
## 1053_at 6.566363 6.651955 6.210328 5.923316
## 117_at 7.775408 8.088513 7.750246 8.042244
## 121_at 10.542663 11.072542 10.749497 10.591697
## 1255_g_at 4.978003 4.976790 5.138461 5.103688
## 1294_at 8.320906 7.986398 8.576420 8.403646
## GSM92276.CEL.gz GSM35979.cel.gz GSM35980.cel.gz GSM35981.cel.gz
## 1007_s_at 11.433151 10.693948 10.883900 10.697330
## 1053_at 6.116081 5.796978 5.944847 5.825616
## 117_at 7.783664 7.943051 7.642011 8.596204
## 121_at 10.916419 12.513036 12.428700 12.507858
## 1255_g_at 5.148009 5.080563 5.037217 5.179477
## 1294_at 8.430276 8.272678 8.282937 8.252376
## GSM35982.cel.gz GSM35983.cel.gz GSM35984.cel.gz GSM35991.cel.gz
## 1007_s_at 9.374745 9.453892 9.423175 10.921897
## 1053_at 6.110668 5.986093 5.881998 5.827053
## 117_at 8.829727 8.831872 8.911270 7.740220
## 121_at 10.963233 11.061160 10.995782 12.530718
## 1255_g_at 5.259811 5.215710 5.059702 5.248926
## 1294_at 8.218427 8.147624 8.072822 8.293092
## GSM35992.cel.gz GSM35993.cel.gz GSM35994.cel.gz GSM35995.cel.gz
## 1007_s_at 10.940071 10.925507 9.138816 8.956655
## 1053_at 5.918342 5.921025 6.227796 6.457326
## 117_at 7.959814 7.583922 8.488102 9.176629
## 121_at 12.599032 12.547042 10.583243 10.491863
## 1255_g_at 5.362281 5.459251 5.053995 5.180982
## 1294_at 8.427539 8.409805 8.162649 8.443923
## GSM35996.cel.gz GSM36003.cel.gz GSM44675.CEL.gz GSM44689.CEL.gz
## 1007_s_at 9.216172 10.599262 10.683256 10.609627
## 1053_at 6.269509 5.679601 6.225941 6.195943
## 117_at 7.763407 8.435550 7.517055 7.557968
## 121_at 10.294510 12.527552 12.198910 10.847165
## 1255_g_at 5.242656 5.050530 5.302377 5.060206
## 1294_at 8.230783 8.513598 7.920678 7.615708
## GSM44697.CEL.gz GSM44702.CEL.gz GSM181429.CEL.gz
## 1007_s_at 10.804661 9.358658 8.082303
## 1053_at 5.944205 6.348221 6.757782
## 117_at 7.504960 8.783509 8.109752
## 121_at 10.780245 11.025208 9.909142
## 1255_g_at 5.415508 5.214369 5.360358
## 1294_at 7.587373 8.050341 8.488678
## GSM181430.CEL.gz GSM181431.CEL.gz GSM181432.CEL.gz
## 1007_s_at 8.672800 8.420832 8.449043
## 1053_at 6.711571 6.529189 6.403990
## 117_at 8.454446 8.121711 7.889064
## 121_at 10.559288 10.153621 10.685172
## 1255_g_at 4.950746 5.267882 5.316058
## 1294_at 8.048616 8.693640 8.697370
## GSM181433.CEL.gz GSM18917.CEL.gz GSM18918.CEL.gz GSM18953.CEL.gz
## 1007_s_at 8.391161 10.341440 10.327703 9.884981
## 1053_at 6.418597 5.888530 5.959998 5.941559
## 117_at 8.297173 8.745092 9.106117 8.905465
## 121_at 10.355128 11.755276 11.710110 11.532460
## 1255_g_at 5.195971 5.244010 5.189766 5.517002
## 1294_at 8.456322 8.510393 8.066057 8.754572
## GSM18954.CEL.gz GSM18955.CEL.gz GSM18956.CEL.gz GSM296875.CEL.gz
## 1007_s_at 9.569800 9.753486 10.059182 10.611813
## 1053_at 5.974681 5.910128 5.884142 6.172908
## 117_at 8.834194 8.302051 8.150492 7.576738
## 121_at 11.273357 11.984180 12.119464 12.239785
## 1255_g_at 5.413542 5.218646 5.130751 5.230148
## 1294_at 8.616671 8.544840 8.585335 8.089192
## GSM296876.CEL.gz GSM296878.CEL.gz GSM296879.CEL.gz
## 1007_s_at 9.743855 10.626825 10.186927
## 1053_at 6.068650 6.113875 6.108924
## 117_at 7.652066 7.404070 8.326484
## 121_at 10.687793 12.115210 11.788156
## 1255_g_at 5.140055 5.082787 5.087534
## 1294_at 8.575553 8.217176 8.107024
## GSM296880.CEL.gz GSM296881.CEL.gz GSM296882.CEL.gz
## 1007_s_at 10.508280 10.159876 10.013988
## 1053_at 6.193704 6.303625 5.979658
## 117_at 7.643538 7.436318 7.449793
## 121_at 11.809707 11.272824 11.669962
## 1255_g_at 5.102386 5.103723 5.010524
## 1294_at 8.425690 8.054656 8.447353
## GSM296883.CEL.gz GSM296886.CEL.gz GSM296887.CEL.gz
## 1007_s_at 10.343827 9.426150 9.732753
## 1053_at 6.115159 6.221048 5.887962
## 117_at 7.618541 7.752332 7.393900
## 121_at 12.159957 10.743438 10.669130
## 1255_g_at 5.200034 5.013313 5.077699
## 1294_at 8.181504 8.458356 8.497650
## GSM296888.CEL.gz GSM296889.CEL.gz GSM296890.CEL.gz
## 1007_s_at 9.505552 9.489496 9.669966
## 1053_at 5.973776 6.045113 5.853426
## 117_at 7.664099 7.309241 7.512285
## 121_at 10.839761 10.444710 10.824855
## 1255_g_at 5.252535 5.023222 5.176021
## 1294_at 8.210043 8.481768 8.603354
## GSM296891.CEL.gz GSM296892.CEL.gz GSM298747.CEL.gz
## 1007_s_at 9.477328 9.645158 8.430692
## 1053_at 6.112785 6.158746 5.845383
## 117_at 8.707780 7.512748 8.786429
## 121_at 10.855914 10.813215 10.407874
## 1255_g_at 5.131532 5.289147 5.242395
## 1294_at 8.631548 8.511193 8.548888
## GSM298748.CEL.gz GSM298749.CEL.gz GSM298750.CEL.gz
## 1007_s_at 8.362492 8.746337 8.280933
## 1053_at 6.072837 6.004847 6.162781
## 117_at 8.886452 8.885333 7.441336
## 121_at 10.448253 10.474274 10.080939
## 1255_g_at 5.308802 5.266128 5.257030
## 1294_at 8.367323 8.530230 8.495641
## GSM299110.CEL.gz GSM299111.CEL.gz GSM299112.CEL.gz
## 1007_s_at 8.430692 8.746337 8.280933
## 1053_at 5.845383 6.004847 6.162781
## 117_at 8.786429 8.885333 7.441336
## 121_at 10.407874 10.474274 10.080939
## 1255_g_at 5.242395 5.266128 5.257030
## 1294_at 8.548888 8.530230 8.495641
## GSM299113.CEL.gz GSM299244.CEL.gz GSM299245.CEL.gz
## 1007_s_at 8.362492 8.601285 8.714787
## 1053_at 6.072837 6.049479 6.214427
## 117_at 8.886452 8.650064 8.092632
## 121_at 10.448253 10.418271 10.264134
## 1255_g_at 5.308802 5.018479 5.308032
## 1294_at 8.367323 8.201954 8.150090
## GSM299246.CEL.gz GSM299247.CEL.gz GSM322969.CEL.gz
## 1007_s_at 8.709645 8.600033 11.920317
## 1053_at 6.125196 6.098775 6.423409
## 117_at 8.292887 8.591519 7.465062
## 121_at 10.545228 10.251836 10.432677
## 1255_g_at 5.124959 5.278551 5.336957
## 1294_at 8.151717 8.652012 8.030188
## GSM323054.CEL.gz GSM323523.CEL.gz GSM323524.CEL.gz
## 1007_s_at 11.919931 11.747558 11.805023
## 1053_at 6.057275 6.192436 6.361326
## 117_at 7.409525 7.454402 7.235285
## 121_at 10.366122 10.172419 10.351052
## 1255_g_at 5.342778 5.335844 5.359751
## 1294_at 7.708620 7.994306 8.124769
## GSM323527.CEL.gz GSM323565.CEL.gz GSM323566.CEL.gz
## 1007_s_at 11.797743 10.040886 11.285002
## 1053_at 6.157979 6.224848 6.170956
## 117_at 7.727192 7.573437 7.323547
## 121_at 10.443504 10.566722 10.250737
## 1255_g_at 5.263736 5.048867 5.062005
## 1294_at 7.825729 7.537564 7.985414
## GSM323567.CEL.gz GSM246492.CEL.gz GSM246493.CEL.gz
## 1007_s_at 9.888693 9.661127 9.803686
## 1053_at 6.211522 6.270153 6.058488
## 117_at 7.228568 7.333568 7.486711
## 121_at 10.191332 9.703713 9.914632
## 1255_g_at 5.186962 5.158631 5.312712
## 1294_at 7.430644 7.821741 7.421166
## GSM246494.CEL.gz GSM307639.CEL.gz GSM307640.CEL.gz
## 1007_s_at 10.509541 9.984502 9.937738
## 1053_at 6.345526 6.715984 6.836179
## 117_at 7.468406 7.120793 7.125811
## 121_at 9.909784 9.409933 9.587782
## 1255_g_at 5.341193 4.896124 5.296695
## 1294_at 7.900080 8.119396 8.005432
## GSM307641.CEL.gz
## 1007_s_at 10.306781
## 1053_at 7.025547
## 117_at 7.407624
## 121_at 9.792904
## 1255_g_at 5.206251
## 1294_at 8.283758
head(tissue)## [1] "kidney" "kidney" "kidney" "kidney" "kidney" "kidney"
Q:How many biological replicates for hippocampus?
sum(tissue=="hippocampus")## [1] 31
##to answer this question for all tissues look at
table(tissue)## tissue
## cerebellum colon endometrium hippocampus kidney liver
## 38 34 15 31 39 26
## placenta
## 6
A:31
Distance Exercises #2
What is the distance between samples 3 and 45?
d <- dist(t(e))
class(d)## [1] "dist"
as.matrix(d)[3,45]## [1] 152.5662
#or
sqrt( crossprod(e[,3]-e[,45]) )## [,1]
## [1,] 152.5662
## or
sqrt( sum((e[,3]-e[,45])^2 ))## [1] 152.5662
A:152.5662
Distance Exercises #3
What is the distance between gene 210486_at and 200805_at
x<-e["210486_at",]
y<-e["200805_at",]
sqrt( crossprod(x-y) )## [,1]
## [1,] 41.01153
#or
sqrt( crossprod(e["210486_at",]-e["200805_at",]) )## [,1]
## [1,] 41.01153
## or
sqrt( sum((e["210486_at",]-e["200805_at",])^2 ))## [1] 41.01153
A: 41.01153
Distance Exercises #4
If I run the command (don’t run it!) d = as.matrix(dist( e))
22215*22215## [1] 493506225
##every pair of rows has an entry:
nrow(e)^2## [1] 493506225
A:493506225
Distance Exercises #5
Compute the distance between all pairs of samples: Q:How many distances are stored in d? (Hint: What is the length of d)?
d = dist(t(e))
length(d)## [1] 17766
A:17766
Distance Exercises #6
ncol(e)^2 #Incorrect## [1] 35721
nrow(e)^2 #Incorrect## [1] 493506225
ncol(e)*(ncol(e)-1)/2 #Because we take advantage of symmetry: only lower triangular matrix is stored thus only ## [1] 17766
Explanation:Note that the distance between samples i and j is the same as distance between samples j and i. Also the distance between a sample and itself is 0. The object returned by dist avoids storing all those values.
http://genomicsclass.github.io/book/pages/pca_motivation.html
Visualizing data is one of the most, if not the most, important step in the analysis of high-throughput data. The right visualization method may reveal problems with the experimental data that can render the results from a standard analysis, although typically appropriate, completely useless.
We have shown methods for visualizing global properties of the columns or rows, but plots that reveal relationships between columns or between rows are more complicated due to the high dimensionality of data. For example, to compare each of the 189 samples to each other, we would have to create, for example, 17,766 MA-plots. Creating one single scatterplot of the data is impossible since points are very high dimensional.
We will describe powerful techniques for exploratory data analysis based on dimension reduction. The general idea is to reduce the dataset to have fewer dimensions, yet approximately preserve important properties, such as the distance between samples. If we are able to reduce down to, say, two dimensions, we can then easily make plots. The technique behind it all, the singular value decomposition (SVD), is also useful in other contexts. Before introducing the rather complicated mathematics behind the SVD, we will motivate the ideas behind it with a simple example.
We consider an example with twin heights. Here we simulate 100 two dimensional points that represent the number of standard deviations each individual is from the mean height. Each point is a pair of twins:
Simulated twin pair heights.
To help with the illustration, think of this as high-throughput gene expression data with the twin pairs representing the \(N\) samples and the two heights representing gene expression from two genes.
We are interested in the distance between any two samples. We can compute this using dist. For example, here is the distance between the two orange points in the figure above:
d=dist(t(y))
as.matrix(d)[1,2]## [1] 1.140897
What if making two dimensional plots was too complex and we were only able to make 1 dimensional plots. Can we, for example, reduce the data to a one dimensional matrix that preserves distances between points?
If we look back at the plot, and visualize a line between any pair of points, the length of this line is the distance between the two points. These lines tend to go along the direction of the diagonal. We have seen before that we can “rotate” the plot so that the diagonal is in the x-axis by making a MA-plot instead:
z1 = (y[1,]+y[2,])/2 #the sum
z2 = (y[1,]-y[2,]) #the difference
z = rbind( z1, z2) #matrix now same dimensions as y
thelim <- c(-3,3)
mypar(1,2)
plot(y[1,],y[2,],xlab="Twin 1 (standardized height)",ylab="Twin 2 (standardized height)",xlim=thelim,ylim=thelim)
points(y[1,1:2],y[2,1:2],col=2,pch=16)
plot(z[1,],z[2,],xlim=thelim,ylim=thelim,xlab="Average height",ylab="Differnece in height")
points(z[1,1:2],z[2,1:2],col=2,pch=16)Twin height scatterplot (left) and MA-plot (right).
Later, we will start using linear algebra to represent transformation of the data such as this. Here we can see that to get z we multiplied y by the matrix:
\[ A = \, \begin{pmatrix} 1/2&1/2\\ 1&-1\\ \end{pmatrix} \implies z = A y \]
Remember that we can transform back by simply multiplying by \(A^{-1}\) as follows:
\[ A^{-1} = \, \begin{pmatrix} 1&1/2\\ 1&-1/2\\ \end{pmatrix} \implies y = A^{-1} z \]
In the plot above, the distance between the two orange points remains roughly the same, relative to the distance between other points. This is true for all pairs of points. A simple re-scaling of the transformation we performed above will actually make the distances exactly the same. What we will do is multiply by a scalar so that the standard deviations of each point is preserved. If you think of the columns of y as independent random variables with standard deviation \(\sigma\), then note that the standard deviations of \(M\) and \(A\) are:
\[ \mbox{sd}[ Z_1 ] = \mbox{sd}[ (Y_1 + Y_2) / 2 ] = \frac{1}{\sqrt{2}} \sigma \mbox{ and } \mbox{sd}[ Z_2] = \mbox{sd}[ Y_1 - Y_2 ] = {\sqrt{2}} \sigma \]
This implies that if we change the transformation above to:
\[ A = \frac{1}{\sqrt{2}} \begin{pmatrix} 1&1\\ 1&-1\\ \end{pmatrix} \]
then the SD of the columns of \(Y\) are the same as the variance of the columns \(Z\). Also, notice that \(A^{-1}A=I\). We call matrices with this properties orthogonal and it guarantees the SD-preserving properties described above. The distances are now exactly preserved:
A <- 1/sqrt(2)*matrix(c(1,1,1,-1),2,2)
z <- A%*%y
d <- dist(t(y))
d2 <- dist(t(z))
mypar(1,1)
plot(as.numeric(d),as.numeric(d2)) #as.numeric turns distnaces into long vector
abline(0,1,col=2)Distance computed from original data and after rotation is the same.
We call this particular transformation a rotation of y.
mypar(1,2)
thelim <- c(-3,3)
plot(y[1,],y[2,],xlab="Twin 1 (standardized height)",ylab="Twin 2 (standardized height)",xlim=thelim,ylim=thelim)
points(y[1,1:2],y[2,1:2],col=2,pch=16)
plot(z[1,],z[2,],xlim=thelim,ylim=thelim,xlab="Average height",ylab="Differnece in height")
points(z[1,1:2],z[2,1:2],col=2,pch=16)Twin height scatterplot (left) and after rotation (right).
The reason we applied this transformation in the first place was because we noticed that to compute the distances between points, we followed a direction along the diagonal in the original plot, which after the rotation falls on the horizontal, or the the first dimension of z. So this rotation actually achieves what we originally wanted: we can preserve the distances between points with just one dimension. Let’s remove the second dimension of z and recompute distances:
d3 = dist(z[1,]) ##distance computed using just first dimension
mypar(1,1)
plot(as.numeric(d),as.numeric(d3))
abline(0,1)Distance computed with just one dimension after rotation versus actual distance.
The distance computed with just the one dimensions provides a very good approximation to the actual distance and a very useful dimension reduction: from 2 dimensions to 1. This first dimension of the transformed data is actually the first principal component. This idea motivates the use of principal component analysis (PCA) and the singular value decomposition (SVD) to achieve dimension reduction more generally.
If you search the web for descriptions of PCA, you will notice a difference in notation to how we describe it here. This mainly stems from the fact that it is more common to have rows represent units. Hence, in the example shown here, \(Y\) would be transposed to be an \(N \times 2\) matrix. In statistics this is also the most common way to represent the data: individuals in the rows. However, for practical reasons, in genomics it is more common to represent units in the columns. For example, genes are rows and samples are columns. For this reason, in this book we explain PCA and all the math that goes with it in a slightly different way than it is usually done. As a result, many of the explanations you find for PCA start out with the sample covariance matrix usually denoted with \(\mathbf{X}^\top\mathbf{X}\) and having cells representing covariance between two units. Yet for this to be the case, we need the rows of \(\mathbf{X}\) to represents units. So in our notation above, you would have to compute, after scaling, \(\mathbf{Y}\mathbf{Y}^\top\) instead.
Basically, if you want our explanations to match others you have to transpose the matrices we show here.
http://genomicsclass.github.io/book/pages/projections.html
Now that we have described the concept of dimension reduction and some of the applications of SVD and principal component analysis, we focus on more details related to the mathematics behind these. We start with projections. A projection is a linear algebra concept that helps us understand many of the mathematical operations we perform on high-dimensional data. For more details, you can review projects in a linear algebra book. Here we provide a quick review and then provide some data analysis related examples.
As a review, remember that projections minimize the distance between points and subspace.
Illustration of projection
In the figure above, the arrow on top is pointing to a point in space. In this particular cartoon, the space is two dimensional, but we should be thinking abstractly. The space is represented by the Cartesian plan and the line on which the little person stands is a subspace of points. The projection to this subspace is the place that is closest to the original point. Geometry tells us that we can find this closest point by dropping a perpendicular line (dotted line) from the point to the space. The little person is standing on the projection. The amount this person had to walk from the origin to the new projected point is referred to as the coordinate.
For the explanation of projections, we will use the standard matrix algebra notation for points: \(\vec{y} \in \mathbb{R}^N\) is a point in \(N\)-dimensional space and \(L \subset \mathbb{R}^N\) is smaller subspace.
If we let \(Y = \begin{pmatrix} 2 \\ 3\end{pmatrix}\). We can plot it like this:
mypar (1,1)
plot(c(0,4),c(0,4),xlab="Dimension 1",ylab="Dimension 2",type="n")
arrows(0,0,2,3,lwd=3)
text(2,3," Y",pos=4,cex=3)Geometric representation of Y.
We can immediately define a coordinate system by projecting this vector to the space defined by: \(\begin{pmatrix} 1\\ 0\end{pmatrix}\) (the x-axis) and \(\begin{pmatrix} 0\\ 1\end{pmatrix}\) (the y-axis). The projections of \(Y\) to the subspace defined by these points are 2 and 3 respectively:
\[ \begin{align*} Y &= \begin{pmatrix} 2 \\ 3\end{pmatrix} \\ &=2 \begin{pmatrix} 1\\ 0\end{pmatrix} + 3 \begin{pmatrix} 0\\ 1\end{pmatrix} \end{align*}\]
We say that \(2\) and \(3\) are the coordinates and that \(\begin{pmatrix} 1\\ 0\end{pmatrix} \mbox{and} \begin{pmatrix} 0\\1 \end{pmatrix}\) are the bases.
Now let’s define a new subspace. The red line in the plot below is subset \(L\) defined by points satisfying \(c \vec{v}\) with \(\vec{v}=\begin{pmatrix} 2& 1\end{pmatrix}^\top\). The projection of \(\vec{y}\) onto \(L\) is the closest point on \(L\) to \(\vec{y}\). So we need to find the \(c\) that minimizes the distance between \(\vec{y}\) and \(c\vec{v}=(2c,c)\). In linear algebra, we learn that the difference between these points is orthogonal to the space so:
\[ (\vec{y}-\hat{c}\vec{v}) \cdot \vec{v} = 0 \]
this implies that:
\[ \vec{y}\cdot\vec{v} - \hat{c}\vec{v}\cdot\vec{v} = 0 \]
and:
\[\hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} \]
Here the dot \(\cdot\) represents the dot product: \(\,\, \vec{x} \cdot \vec{y} = x_1 y_1+\dots x_n y_n\).
The following R code confirms this equation works:
mypar(1,1)
plot(c(0,4),c(0,4),xlab="Dimension 1",ylab="Dimension 2",type="n")
arrows(0,0,2,3,lwd=3)
abline(0,0.5,col="red",lwd=3) #if x=2c and y=c then slope is 0.5 (y=0.5x)
text(2,3," Y",pos=4,cex=3)
y=c(2,3)
x=c(2,1)
cc = crossprod(x,y)/crossprod(x)
segments(x[1]*cc,x[2]*cc,y[1],y[2],lty=2)
text(x[1]*cc,x[2]*cc,expression(hat(Y)),pos=4,cex=3)Projection of Y onto new subspace.
Note that if \(\vec{v}\) was such that \(\vec{v}\cdot \vec{v}=1\), then \(\hat{c}\) is simply \(\vec{y} \cdot \vec{v}\) and the space \(L\) does not change. This simplification is one reason we like orthogonal matrices.
Let \(\vec{y} \in \mathbb{R}^N\) and \(L \subset \mathbb{R}^N\) is the space spanned by:
\[\vec{v}=\begin{pmatrix} 1\\ \vdots \\ 1\end{pmatrix}; L = \{ c \vec{v}; c \in \mathbb{R}\} \]
In this space, all components of the vectors are the same number, so we can think of this space as representing the constants: in the projection each dimension will be the same value. So what \(c\) minimizes the distance between \(c\vec{v}\) and \(\vec{y}\) ?
When talking about problems like this, we sometimes use 2 dimensional figures such as the one above. We simply abstract and think of \(\vec{y}\) as a point in \(N-dimensions\) and \(L\) as a subspace defined by a smaller number of values, in this case just one: \(c\).
Getting back to our question, we know that the projection is:
\[\hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} \]
which in this case is the average:
\[ \hat{c} = \frac{\vec{y}\cdot\vec{v}} {\vec{v}\cdot\vec{v}} = \frac{\sum_{i=1}^N Y_i}{\sum_{i=1}^N 1} = \bar{Y} \]
Here, it also would have been just as easy to use calculus:
\[\frac{\partial}{\partial c}\sum_{i=1}^N (Y_i - c)^2 = 0 \implies - 2 \sum_{i=1}^N (Y_i - \hat{c}) = 0 \implies\]
\[ N c = \sum_{i=1}^N Y_i \implies \hat{c}=\bar{Y}\]
Let us give a slightly more complicated example. Simple linear regression can also be explained with projections. Our data \(\mathbf{Y}\) (we are no longer going to use the \(\vec{y}\) notation) is again an \(N-dimensional\) vector and our model predicts \(Y_i\) with a line \(\beta_0 + \beta_1 X_i\). We want to find the \(\beta_0\) and \(\beta_1\) that minimize the distance between \(Y\) and the space defined by:
\[ L = \{ \beta_0 \vec{v}_0 + \beta_1 \vec{v}_1 ; \vec{\beta}=(\beta_0,\beta_1) \in \mathbb{R}^2 \}\]
with:
\[ \vec{v}_0= \begin{pmatrix} 1\\ 1\\ \vdots \\ 1\\ \end{pmatrix} \mbox{ and } \vec{v}_1= \begin{pmatrix} X_{1}\\ X_{2}\\ \vdots \\ X_{N}\\ \end{pmatrix} \]
Our \(N\times 2\) matrix \(\mathbf{X}\) is \([ \vec{v}_0 \,\, \vec{v}_1]\) and any point in \(L\) can be written as \(X\vec{\beta}\).
The equation for the multidimensional version of orthogonal projection is:
\[X^\top (\vec{y}-X\vec{\beta}) = 0\]
which we have seen before and gives us:
\[X^\top X \hat{\beta}= X^\top \vec{y} \]
\[\hat{\beta}= (X^\top X)^{-1}X^\top \vec{y}\]
And the projection to \(L\) is therefore:
\[X (X^\top X)^{-1}X^\top \vec{y}\]
http://genomicsclass.github.io/book/pages/rotations.html
One of the most useful applications of projections relates to coordinate rotations. In data analysis, simple rotations can result in easier to visualize and interpret data. We will describe the mathematics behind rotations and give some data analysis examples.
In our previous section, we used the following example:
\[ Y = \begin{pmatrix} 2 \\ 3 \end{pmatrix} = 2 \begin{pmatrix} 1\\ 0 \end{pmatrix} + 3 \begin{pmatrix} 0\\ 1 \end{pmatrix} \]
and noted that \(2\) and \(3\) are the coordinates
library(rafalib)
mypar()
plot(c(-2,4),c(-2,4),xlab="Dimension 1",ylab="Dimension 2",type="n",xaxt="n",yaxt="n",bty="n")
text(rep(0,6),c(c(-2,-1),c(1:4)),as.character(c(c(-2,-1),c(1:4))),pos=2)
text(c(c(-2,-1),c(1:4)),rep(0,6),as.character(c(c(-2,-1),c(1:4))),pos=1)
abline(v=0,h=0)
arrows(0,0,2,3,lwd=3)
segments(2,0,2,3,lty=2)
segments(0,3,2,3,lty=2)
text(2,3," Y",pos=4,cex=3)However, mathematically we can represent the point \((2,3)\) with other linear combinations:
\[ \begin{align*} Y &= \begin{pmatrix} 2 \\ 3\end{pmatrix} \\ &= 2.5 \begin{pmatrix} 1\\ 1\end{pmatrix} + -1 \begin{pmatrix} \phantom{-}0.5\\ -0.5\end{pmatrix} \end{align*}\]
The new coordinates are:
\[Z = \begin{pmatrix} 2.5 \\ -1 \end{pmatrix} \]
Graphically, we can see that the coordinates are the projections to the spaces defined by the new basis:
library(rafalib)
mypar()
plot(c(-2,4),c(-2,4),xlab="Dimension 1",ylab="Dimension 2",type="n",xaxt="n",yaxt="n",bty="n")
text(rep(0,6),c(c(-2,-1),c(1:4)),as.character(c(c(-2,-1),c(1:4))),pos=2)
text(c(c(-2,-1),c(1:4)),rep(0,6),as.character(c(c(-2,-1),c(1:4))),pos=1)
abline(v=0,h=0)
abline(0,1,col="red")
abline(0,-1,col="red")
arrows(0,0,2,3,lwd=3)
y=c(2,3)
x1=c(1,1)##new basis
x2=c(0.5,-0.5)##new basis
c1 = crossprod(x1,y)/crossprod(x1)
c2 = crossprod(x2,y)/crossprod(x2)
segments(x1[1]*c1,x1[2]*c1,y[1],y[2],lty=2)
segments(x2[1]*c2,x2[2]*c2,y[1],y[2],lty=2)
text(2,3," Y",pos=4,cex=3)We can go back and forth between these two representations of \((2,3)\) using matrix multiplication.
\[ Y = AZ\\ \]
\[ A^{-1} Y = Z\\ \]
\[ A= \begin{pmatrix} 1& \phantom{-}0.5\\ 1 & -0.5\end{pmatrix} \implies A^{-1}= \begin{pmatrix} 0.5& 0.5 \\ 1 &-1\end{pmatrix} \]
\(Z\) and \(Y\) carry the same information, but in a different coordinate system.
Here are 100 two dimensional points \(Y\)
Here are the rotations: \[Z = A^{-1} Y\]
What we have done here is rotate the data so that the first coordinate of \(Z\) is the average height, while the second is the difference between twin heights.
We have used the singular value decomposition to find principal components. It is sometimes useful to think of the SVD as a rotation, for example \(\mathbf{U}^\top \mathbf{Y}\), that gives us a new coordinate system \(\mathbf{DV}^\top\) in which the dimensions are ordered by how much variance they explain.
We will continue to use this dataset:
library(Biobase) library(GSE5859Subset) data(GSE5859Subset)
Projections Exercises #1
Suppose you want to make an MA plot of the first two samples y = geneExpression[,1:2]. Which of the following projections of y gives us new coordinates such that column 2 versus column 1 is an MA plot?
1
2
http://genomicsclass.github.io/book/pages/svd.html
In the previous section, we motivated dimension reduction and showed a transformation that permitted us to approximate the distance between two dimensional points with just one dimension. The singular value decomposition (SVD) is a generalization of the algorithm we used in the motivational section. As in the example, the SVD provides a transformation of the original data. This transformation has some very useful properties.
The main result SVD provides is that we can write an \(m \times n\), matrix \(\mathbf{Y}\) as
\[\mathbf{U}^\top\mathbf{Y} = \mathbf{DV}^\top\]
With:
with \(p=\mbox{min}(m,n)\). \(\mathbf{U}^\top\) provides a rotation of our data \(\mathbf{Y}\) that turns out to be very useful because the variability (sum of squares to be precise) of the columns of \(\mathbf{U}^\top \mathbf{Y}=\mathbf{VD}\) are decreasing. Because \(\mathbf{U}\) is orthogonal, we can write the SVD like this:
\[\mathbf{Y} = \mathbf{UDV}^\top\]
In fact, this formula is much more commonly used. We can also write the transformation like this:
\[\mathbf{YV} = \mathbf{UD}\]
This transformation of \(Y\) also results in a matrix with column of decreasing sum of squares.
Applying the SVD to the motivating example we have:
library(rafalib)
library(MASS)
n <- 100
y <- t(mvrnorm(n,c(0,0), matrix(c(1,0.95,0.95,1),2,2)))
mypar()
LIM<-c(-3.5,3.5)
plot(y[1,],y[2,],xlim = LIM,ylim = LIM)s <- svd(y)We can immediately see that applying the SVD results in a transformation very similar to the one we used in the motivating example:
round(sqrt(2) * s$u , 3)## [,1] [,2]
## [1,] -0.982 -1.017
## [2,] -1.017 0.982
The plot we showed after the rotation, was showing what we call the principal components: the second plotted against the first. To obtain the principal components from the SVD, we simply need the columns of the rotation \(\mathbf{U}^\top\mathbf{Y}\) :
PC1 = s$d[1]*s$v[,1]
PC2 = s$d[2]*s$v[,2]
plot(PC1,PC2,xlim=c(-3,3),ylim=c(-3,3))Second PC plotted against first PC for the twins height data
It is not immediately obvious how incredibly useful the SVD can be, so let’s consider some examples. In this example, we will greatly reduce the dimension of \(V\) and still be able to reconstruct \(Y\).
Let’s compute the SVD on the gene expression table we have been working with. We will take a subset of 100 genes so that computations are faster.
library(tissuesGeneExpression)
data(tissuesGeneExpression)
set.seed(1)
ind <- sample(nrow(e),500)
Y <- t(apply(e[ind,],1,scale)) #standardize data for illustrationThe svd command returns the three matrices (only the diagonal entries are returned for \(D\))
s <- svd(Y)
U <- s$u
V <- s$v
D <- diag(s$d) ##turn it into a matrixFirst note that we can in fact reconstruct y:
Yhat <- U %*% D %*% t(V)
resid <- Y - Yhat
max(abs(resid))## [1] 3.552714e-14
If we look at the sum of squares of \(\mathbf{UD}\), we see that the last few are quite close to 0 (perhaps we have some replicated columns).
plot(s$d)Entries of the diagonal of D for gene expression data.
This implies that the last columns of V have a very small effect on the reconstruction of Y. To see this, consider the extreme example in which the last entry of \(V\) is 0. In this case the last column of \(V\) is not needed at all. Because of the way the SVD is created, the columns of \(V\), have less and less influence on the reconstruction of \(Y\). You commonly see this described as “explaining less variance”. This implies that for a large matrix, by the time you get to the last columns, it is possible that there is not much left to “explain” As an example, we will look at what happens if we remove the four last columns:
k <- ncol(U)-4
Yhat <- U[,1:k] %*% D[1:k,1:k] %*% t(V[,1:k])
resid <- Y - Yhat
max(abs(resid))## [1] 3.552714e-14
The largest residual is practically 0, meaning that we Yhat is practically the same as Y, yet we need 4 less dimensions to transmit the information.
By looking at \(d\), we can see that, in this particular dataset, we can obtain a good approximation keeping only 94 columns. The following plots are useful for seeing how much of the variability is explained by each column:
plot(s$d^2/sum(s$d^2)*100,ylab="Percent variability explained")Percent variance explained by each principal component of gene expression data.
We can also make a cumulative plot:
plot(cumsum(s$d^2)/sum(s$d^2)*100,ylab="Percent variability explained",ylim=c(0,100),type="l")Cumulative variance explained by principal components of gene expression data.
Although we start with 189 dimensions, we can approximate \(Y\) with just 95:
k <- 95 ##out a possible 189
Yhat <- U[,1:k] %*% D[1:k,1:k] %*% t(V[,1:k])
resid <- Y - Yhat
boxplot(resid,ylim=quantile(Y,c(0.01,0.99)),range=0)Residuals from comparing a reconstructed gene expression table using 95 PCs to the original data with 189 dimensions.
Therefore, by using only half as many dimensions, we retain most of the variability in our data:
var(as.vector(resid))/var(as.vector(Y))## [1] 0.04076899
We say that we explain 96% of the variability.
Note that we can compute this proportion from \(D\):
1-sum(s$d[1:k]^2)/sum(s$d^2)## [1] 0.04076899
The entries of \(D\) therefore tell us how much each PC contributes in term of variability explained.
For the following questions use the data loaded with:
library(tissuesGeneExpression) data(tissuesGeneExpression)
3
s = svd(e) signflips = sample(c(-1,1),ncol(e),replace=TRUE) signflips
4
newu= sweep(s\(u,2,signflips,FUN="*") newv= sweep(s\)v,2,signflips,FUN=“" ) all.equal( s\(u %*% diag(s\)d) %% t(s\(v), newu %*% diag(s\)d) %*% t(newv))
This is important to know because different implementations of the SVD algorithm may give different signs, which can lead to the same code resulting in different answers when run in different computer systems.
Compute the SVD of e
library(tissuesGeneExpression)
data(tissuesGeneExpression)
s = svd(e)Now compute the mean of each row:
m = rowMeans(e)Q:What is the correlation between the first column of U and m?
cor(s$u[,1],m)## [1] -0.9999998
A:-0.9999998
SVD Exercises #2
In the above question we saw how the first column relates to the mean of the rows of e. Note that if we change these means, the distances between columns do not change. Here some R code showing how changing the means does not change the distances:
newmeans = rnorm(nrow(e)) ##random values we will add to create new means
newe = e+newmeans ##we change the means
sqrt(crossprod(e[,3]-e[,45]))## [,1]
## [1,] 152.5662
sqrt(crossprod(newe[,3]-newe[,45]))## [,1]
## [1,] 152.5662
So we might as well make the mean of each row 0 since it does not help us approximate the column distances. We will define y as the detrended e and recompute the SVD:
y = e - rowMeans(e)
s = svd(y)5
resid = y - s$u %*% diag(s$d) %*% t(s$v)
max(abs(resid))## [1] 1.188383e-12
6
x=matrix(rep(c(1,2),each=5),5,2)
x## [,1] [,2]
## [1,] 1 2
## [2,] 1 2
## [3,] 1 2
## [4,] 1 2
## [5,] 1 2
x*c(1:5)## [,1] [,2]
## [1,] 1 2
## [2,] 2 4
## [3,] 3 6
## [4,] 4 8
## [5,] 5 10
Note that the above code is actually equivalent to:
sweep(x,1,1:5,"*")## [,1] [,2]
## [1,] 1 2
## [2,] 2 4
## [3,] 3 6
## [4,] 4 8
## [5,] 5 10
7
head(diag(s$d)%*%t(s$v))## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 33.367637 41.4831033 41.3231142 40.054572 36.146505 40.214522
## [2,] -27.601399 -36.9912840 3.8035407 -9.754504 -19.703781 5.935035
## [3,] -50.926062 -36.4568742 -36.1291140 -32.010383 -44.331295 -29.773335
## [4,] 4.524792 -1.0473508 0.5466589 4.963027 6.833245 2.071492
## [5,] -21.690513 -10.5313431 -48.1445359 -37.266098 -27.495327 -51.230683
## [6,] -5.230301 -0.7337037 0.6569955 11.035631 6.149118 1.222308
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 32.59423 42.583783 32.465606 42.5402783 39.373302 39.520781
## [2,] -11.28229 2.647384 -27.595578 1.5585022 -24.119388 8.198632
## [3,] -14.29959 -37.908111 -46.801081 -37.0916036 -44.650695 -26.351326
## [4,] 25.35232 2.140486 4.794053 0.6245207 -4.025119 3.115839
## [5,] -34.41662 -47.687154 -9.856736 -44.5472121 -20.512852 -51.098702
## [6,] 19.98217 -1.410014 8.796039 0.9599773 2.024163 2.140379
## [,13] [,14] [,15] [,16] [,17] [,18]
## [1,] 41.3492974 43.0444694 36.0590381 40.925094 -60.457238 -75.211616
## [2,] -19.9228038 3.8530855 -19.4372820 -5.384144 10.227393 12.114447
## [3,] -41.7314839 -37.8437578 -37.3469638 -44.321504 -5.132005 -4.193173
## [4,] 5.2151606 -0.6371225 0.8709797 -1.360416 54.828739 49.993499
## [5,] -26.8829708 -46.7057159 -29.8633453 -41.059262 6.457505 7.554507
## [6,] 0.3630771 -0.4588111 9.7093570 -5.468645 3.352128 1.254389
## [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] -44.15597 -71.769026 -44.25359 -67.307760 -63.4041693 -78.677496
## [2,] 28.95854 3.312265 22.16562 15.080366 16.6166982 14.365586
## [3,] 26.46644 -26.649087 27.54664 -2.610385 -0.3471538 -13.716782
## [4,] 65.59601 30.222790 60.37822 48.986309 60.4644378 45.680290
## [5,] -20.94215 24.672160 -21.24658 1.839369 1.3876087 13.308978
## [6,] 20.24829 -4.854346 16.41612 4.421392 4.3140449 -1.580885
## [,25] [,26] [,27] [,28] [,29] [,30]
## [1,] -57.572295 -51.236362 -86.192069 -86.246537 -87.373816 -87.971240
## [2,] 11.940728 14.762366 14.729293 10.426813 12.136567 13.584687
## [3,] -5.635827 2.572384 -6.621586 -14.990118 -17.676922 -12.221112
## [4,] 59.217052 60.766336 37.811851 33.009076 33.628887 46.917434
## [5,] 1.715368 -2.752669 9.481619 20.761441 20.158266 16.485205
## [6,] 2.364528 7.975162 -1.748872 -2.887229 -4.610575 -1.292198
## [,31] [,32] [,33] [,34] [,35] [,36]
## [1,] -94.173237 -68.975716 -77.813134 -84.70253 -75.71193 -77.406183
## [2,] 9.491519 9.127942 10.946127 12.77127 10.39398 12.863036
## [3,] -20.277492 -14.364144 -18.886451 -15.24421 -19.89104 -21.105707
## [4,] 20.373444 50.883197 39.870979 29.58295 37.50556 46.345993
## [5,] 25.763909 14.622713 19.649893 17.91162 16.61512 17.165529
## [6,] -8.174586 -2.391924 -2.672529 -3.71899 -3.76827 -2.156575
## [,37] [,38] [,39] [,40] [,41] [,42]
## [1,] -75.042931 -75.684512 -64.2192103 -72.507041 -71.516921 -77.897176
## [2,] 4.948577 7.612535 10.9887405 9.769956 8.310735 7.312644
## [3,] -35.528578 -22.388048 -5.3956002 -18.198432 -24.233036 -23.139147
## [4,] 35.975543 38.603160 45.9184759 46.751950 44.010880 29.999380
## [5,] 27.147774 23.570991 5.7929946 18.145949 19.392003 23.704514
## [6,] -8.121232 -6.048199 0.9269622 -1.784518 -4.475445 -6.759738
## [,43] [,44] [,45] [,46] [,47] [,48]
## [1,] -58.639533 -65.2141149 -70.035000 -73.7757676 -81.0525402 -78.723452
## [2,] 9.326129 11.0749525 15.221963 14.1575678 13.0177523 14.770936
## [3,] -14.214822 -15.9047645 -8.333618 -4.4617641 26.1952147 24.385261
## [4,] 47.612046 36.9834476 49.849995 41.1116866 -42.9035266 -32.929194
## [5,] 10.627387 13.6933536 6.595174 6.3655263 -15.9663464 -16.032088
## [6,] 1.913975 -0.5173279 2.113035 0.2809776 -0.1751373 2.609717
## [,49] [,50] [,51] [,52] [,53] [,54]
## [1,] -92.584069 -94.284715 -83.9090802 -85.5587565 -92.664422 -81.2806187
## [2,] 10.513449 11.331870 12.1360409 13.2063731 10.197738 12.1522335
## [3,] 16.387548 11.252822 22.3108034 20.0190004 19.328898 18.7769469
## [4,] -49.475184 -42.223194 -40.7993584 -45.4622202 -50.411798 -37.5384548
## [5,] -8.306340 -5.658941 -13.2573022 -11.9229408 -8.582672 -11.2250079
## [6,] -3.452116 -4.048821 -0.9851206 -0.8062555 -3.158114 -0.2423327
## [,55] [,56] [,57] [,58] [,59] [,60]
## [1,] -90.056835 -81.579261 -86.904904 -89.447061 -83.1749740 -92.143550
## [2,] 11.694871 17.823307 13.281555 9.397437 11.5131214 13.053308
## [3,] 7.615741 14.528072 14.490858 10.528267 26.8460968 9.109598
## [4,] -32.738793 -20.833561 -37.929329 -39.344576 -35.7292839 -28.578189
## [5,] -3.133610 -12.390794 -7.951112 -5.321080 -17.1183938 -1.917909
## [6,] -3.848836 2.016966 -2.073287 -4.180562 0.4460951 -3.096856
## [,61] [,62] [,63] [,64] [,65] [,66]
## [1,] -91.456078 -88.887046 -90.233124 -96.744910 -86.6261752 -91.997996
## [2,] 11.462692 11.013860 12.539940 9.674577 14.2762227 12.096475
## [3,] 15.343854 17.705234 23.991978 11.858034 20.4639947 17.439556
## [4,] -38.619974 -46.991836 -45.646917 -50.055940 -35.8553309 -50.872410
## [5,] -7.713613 -6.639571 -10.831373 -2.144098 -10.9904606 -3.666205
## [6,] -3.252670 -2.838944 -2.043479 -5.495929 -0.1032774 -3.078567
## [,67] [,68] [,69] [,70] [,71] [,72]
## [1,] -79.4161180 -83.114347 -80.276192 -93.332010 -85.7186866 -90.666677
## [2,] 15.1517286 11.726002 15.941913 10.517398 14.5372751 12.347679
## [3,] 14.4623966 12.004229 24.305786 16.263302 23.9922845 15.236269
## [4,] -29.6245948 -36.001964 -36.183648 -50.590173 -35.5550203 -42.342573
## [5,] -11.9008018 -7.563350 -13.249556 -5.051157 -11.2256703 -6.278001
## [6,] 0.4639882 -2.243848 2.983838 -2.677211 0.8018736 -3.734171
## [,73] [,74] [,75] [,76] [,77] [,78]
## [1,] -87.632737 33.314366 34.28248 32.714225 31.789337 31.369118
## [2,] 11.704497 -8.306140 -37.40025 -36.313390 -5.398339 -32.592026
## [3,] 11.799841 -48.553364 -60.10559 -51.237283 -34.228947 -54.631349
## [4,] -41.688734 -6.546652 -13.81186 -13.245673 -9.158657 -14.825575
## [5,] -6.613230 -40.432225 -10.08654 -14.575518 -44.787446 -15.059038
## [6,] -3.659209 -8.391221 -12.53913 -6.311445 -4.592069 -2.853735
## [,79] [,80] [,81] [,82] [,83] [,84]
## [1,] 33.749103 28.381207 41.432926 33.662836 26.60776 26.631732
## [2,] -34.709635 -35.444291 -24.036766 1.336311 -34.24782 -37.927647
## [3,] -64.470246 -66.507667 -40.951957 -37.044633 -62.86029 -68.176073
## [4,] -17.040738 -22.850756 -7.273004 -3.124432 -13.83541 -5.751024
## [5,] -13.387400 -12.456951 -19.145823 -46.788572 -10.09864 -9.124225
## [6,] -8.389626 -7.064372 -4.283177 -5.738806 -8.34524 -5.714543
## [,85] [,86] [,87] [,88] [,89] [,90]
## [1,] 40.834528 40.514163 46.907726 40.231520 40.105127 40.782618
## [2,] -27.513912 -24.728950 -46.055433 -47.008404 -50.325528 -40.268213
## [3,] -45.459729 -21.357504 35.515171 34.943135 29.844441 37.013588
## [4,] -10.242214 -3.836492 4.260908 5.781422 1.787436 2.807525
## [5,] -10.057771 -8.391107 14.035649 13.189788 15.623377 14.045943
## [6,] -3.090973 1.515807 -19.394340 -16.115001 -19.773190 -13.337995
## [,91] [,92] [,93] [,94] [,95] [,96]
## [1,] 43.406114 37.085924 36.760032 34.484704 31.432023 35.359492
## [2,] -49.716762 -40.529108 -49.519598 -46.839334 -40.090512 -51.937781
## [3,] 25.559645 28.985812 36.552190 48.857606 57.579230 35.029694
## [4,] -1.903568 12.375724 2.680657 3.776674 10.933010 -1.627285
## [5,] 14.580099 3.103755 9.966710 10.162979 5.373957 17.336728
## [6,] -20.096101 -14.849499 -22.431974 -22.285589 -15.004971 -21.961573
## [,97] [,98] [,99] [,100] [,101] [,102]
## [1,] 40.646748 39.057892 35.283735 34.842447 37.31653 34.005723
## [2,] -48.975159 -44.751827 -48.952934 -32.405206 -36.24102 -46.997263
## [3,] 46.328410 36.178407 48.040356 56.189465 45.46078 33.546115
## [4,] 1.559785 5.638875 5.452021 11.702125 13.80448 4.231480
## [5,] 17.536384 10.211294 14.866807 -8.154364 -4.43691 7.816503
## [6,] -20.474456 -13.970778 -23.345771 -17.155983 -13.86496 -21.497009
## [,103] [,104] [,105] [,106] [,107] [,108]
## [1,] 36.1632209 33.670727 34.681902 32.871491 35.1441322 31.541488
## [2,] -46.8574689 -47.674143 -48.934565 -25.880847 -53.7877641 -48.703807
## [3,] 43.9073132 40.436640 45.837381 58.396256 36.6482569 33.993728
## [4,] -0.7209663 10.024784 5.237659 21.756277 0.8467065 12.644877
## [5,] 13.0948583 6.304818 9.561762 -16.204317 15.8793871 6.152819
## [6,] -21.0891764 -22.173333 -26.820179 -8.213545 -25.6033224 -20.161207
## [,109] [,110] [,111] [,112] [,113] [,114]
## [1,] 37.466705 38.977120 35.74300 36.373310 42.655368 44.015622
## [2,] -40.235347 -49.599028 -29.30029 -44.047932 -46.655314 -44.594376
## [3,] 42.260781 30.699038 42.49971 43.396432 36.540274 41.327025
## [4,] 7.536322 4.897427 10.91812 4.636752 -1.412658 3.311874
## [5,] 9.002062 19.361876 -10.55998 9.162079 14.847425 10.063732
## [6,] -15.995086 -20.770830 -12.45712 -18.175875 -17.236909 -15.900752
## [,115] [,116] [,117] [,118] [,119] [,120]
## [1,] 35.426895 41.629626 37.834989 33.764086 35.671487 34.347328
## [2,] -44.598570 -46.008158 -41.892543 -38.008929 -38.842025 -36.300604
## [3,] 48.853194 27.580561 44.965919 41.173175 36.832561 45.044267
## [4,] 7.913213 5.247728 6.599312 13.829884 11.948720 11.692827
## [5,] 8.343913 8.463389 8.738752 -2.001485 -2.790566 -1.172338
## [6,] -24.497354 -16.958048 -16.254917 -16.159725 -15.013996 -11.046586
## [,121] [,122] [,123] [,124] [,125] [,126]
## [1,] 28.4388490 28.8657695 30.322130 68.435864 68.972434 68.2083968
## [2,] -16.5643725 -16.5168836 -1.131207 65.203459 67.924514 65.1255031
## [3,] -17.4005896 -15.7104024 13.283234 11.657900 17.288736 13.8415857
## [4,] -2.7888013 -3.8353698 10.726409 -1.502951 2.185966 0.5892529
## [5,] -32.0634122 -33.0245579 -47.862244 6.337016 1.643290 4.9071939
## [6,] 0.6584198 0.8709017 14.786558 -6.964477 -3.204043 -5.9392508
## [,127] [,128] [,129] [,130] [,131] [,132]
## [1,] 27.941744 28.725684 27.521909 72.30697 78.263842 73.34954
## [2,] -25.123973 -23.393290 -26.929069 55.65595 67.595255 55.16477
## [3,] -33.413997 -30.169395 -37.623886 -19.19935 -1.470898 -22.28375
## [4,] -11.766005 -10.537170 -14.253977 -15.85260 -6.545949 -18.40324
## [5,] -23.952314 -23.813260 -22.101393 31.33529 10.944371 34.26888
## [6,] -5.062072 -2.963673 -6.178614 -22.04901 -6.874094 -23.23595
## [,133] [,134] [,135] [,136] [,137] [,138]
## [1,] 29.354389 40.1114167 -64.9830154 -71.3409020 68.006301 68.20969
## [2,] -10.127824 -0.5657104 22.7268600 12.7558427 68.436793 65.78385
## [3,] -3.354383 -7.1413694 29.3723204 0.6453371 13.456476 -20.11513
## [4,] 2.478723 6.2103351 -0.9485795 20.7933695 -1.689245 -24.04656
## [5,] -38.194417 -34.3383585 -21.1974608 0.8135168 7.978905 42.11802
## [6,] 5.534724 7.1165564 13.7670883 3.1881075 -8.278970 -16.89485
## [,139] [,140] [,141] [,142] [,143] [,144]
## [1,] 70.3674459 72.27429 69.447949 78.7482696 -28.82868 -26.26544
## [2,] 75.2676823 61.84340 77.721061 79.0253794 41.23766 39.50850
## [3,] -0.4037094 -21.77508 14.702300 0.5757421 65.48720 66.01022
## [4,] -14.5276786 -21.65319 -1.693676 -9.4852261 14.81416 17.42570
## [5,] 26.7780028 36.64010 7.757327 20.7927812 -54.01261 -55.32920
## [6,] -8.6661990 -15.44135 2.421507 0.1064769 37.77934 37.70582
## [,145] [,146] [,147] [,148] [,149] [,150]
## [1,] 60.17189 64.03324 41.15777 41.47376 30.101725 11.779245
## [2,] 79.64614 83.65719 40.59086 35.43083 -33.310918 -25.136022
## [3,] 63.64943 62.82029 51.22174 36.90235 -6.745666 -21.975715
## [4,] 29.99845 26.69236 34.22890 28.89347 -5.787455 -13.354616
## [5,] -44.44604 -43.83300 -78.30757 -72.37405 2.259030 9.298516
## [6,] 33.12262 34.80677 40.01084 33.67397 14.178404 9.623824
## [,151] [,152] [,153] [,154] [,155] [,156]
## [1,] 29.460157 28.070405 30.926110 20.055650 26.339439 29.059920268
## [2,] -39.394487 -34.995285 -35.976172 -41.091805 -30.446289 -30.358328639
## [3,] -10.180139 2.656477 -7.479663 -15.485143 4.283788 -0.008798855
## [4,] -11.335027 -4.039199 -12.669814 -7.580626 -4.402318 -7.290759338
## [5,] 10.834789 1.275933 9.835466 8.212026 7.278833 -2.980116158
## [6,] 8.511958 9.938846 10.082679 7.123183 10.132298 11.299425693
## [,157] [,158] [,159] [,160] [,161] [,162]
## [1,] 15.456020 15.680297 14.607636 7.486508 12.608877 27.910549
## [2,] -35.024631 -31.902263 -23.354183 -35.326185 -26.395047 -31.837675
## [3,] -35.416566 -32.715015 -14.094522 -39.911163 -34.039330 -28.351016
## [4,] -6.962547 -2.090082 1.691843 -5.639383 -4.526072 -2.860365
## [5,] 10.067893 5.851227 -5.564587 11.617660 5.496256 12.290105
## [6,] 7.335580 12.067746 11.829491 7.991133 6.636579 6.922362
## [,163] [,164] [,165] [,166] [,167] [,168]
## [1,] 24.703197 85.00936643 81.224927 80.541713 83.31780 85.00936643
## [2,] -35.046177 83.19134243 88.304562 78.209180 83.78608 83.19134243
## [3,] -32.814228 0.06082154 -1.280238 1.463048 -14.93206 0.06082154
## [4,] -5.950835 -2.58632020 -6.366312 -3.378867 -14.96246 -2.58632020
## [5,] 9.297420 20.22444755 22.001401 18.966904 32.37009 20.22444755
## [6,] 6.972852 -6.66505693 -8.463999 -5.419126 -13.19450 -6.66505693
## [,169] [,170] [,171] [,172] [,173] [,174]
## [1,] 80.541713 83.31780 81.224927 78.120610 80.591847 78.733247
## [2,] 78.209180 83.78608 88.304562 85.483368 87.517638 95.458838
## [3,] 1.463048 -14.93206 -1.280238 -6.650727 -5.974910 8.931424
## [4,] -3.378867 -14.96246 -6.366312 -10.897758 -13.303653 -3.712412
## [5,] 18.966904 32.37009 22.001401 21.332763 25.249790 11.860348
## [6,] -5.419126 -13.19450 -8.463999 -9.813258 -9.396266 -4.729341
## [,175] [,176] [,177] [,178] [,179] [,180]
## [1,] 83.268211 -24.4119520 -53.368473 -49.901942 -40.819934 -37.1520758
## [2,] 91.883614 -7.9618118 -5.207480 -13.717189 -14.424506 -10.9835337
## [3,] -0.806764 -30.6437611 -33.272994 -50.925211 -45.942245 -38.4919417
## [4,] -12.998304 3.0841163 -4.833968 -4.968734 -5.143056 0.6631418
## [5,] 20.569930 24.5018735 26.602936 38.158191 31.630936 24.3114681
## [6,] -7.205963 -0.3620428 -5.450260 -10.981845 -6.737838 -4.6758068
## [,181] [,182] [,183] [,184] [,185] [,186]
## [1,] -89.061675 -79.608354 -99.826173 30.35769 24.929772 39.67329
## [2,] 10.379737 -1.401949 6.971805 -58.47980 -51.726650 -44.62706
## [3,] 1.412092 -25.064365 -6.961008 -12.30664 -6.589041 10.25816
## [4,] -41.059393 -21.325554 -49.302884 -25.87458 -20.229914 -18.87999
## [5,] 3.968298 14.846073 12.289889 66.68990 57.532339 49.29995
## [6,] -2.558079 -12.710234 -6.324220 86.66681 80.026494 91.76388
## [,187] [,188] [,189]
## [1,] 43.96331 45.48406 51.89815
## [2,] -36.27300 -24.41609 -22.01634
## [3,] 11.02132 21.33785 27.14429
## [4,] -17.87021 -15.08956 -10.94304
## [5,] 48.88683 36.52675 34.96074
## [6,] 99.18317 101.97758 108.37173
#s$d %*% t(s$v)1 # incorrect
head(s$d * t(s$v))## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 33.367637 41.4831033 41.3231142 40.054572 36.146505 40.214522
## [2,] -27.601399 -36.9912840 3.8035407 -9.754504 -19.703781 5.935035
## [3,] -50.926062 -36.4568742 -36.1291140 -32.010383 -44.331295 -29.773335
## [4,] 4.524792 -1.0473508 0.5466589 4.963027 6.833245 2.071492
## [5,] -21.690513 -10.5313431 -48.1445359 -37.266098 -27.495327 -51.230683
## [6,] -5.230301 -0.7337037 0.6569955 11.035631 6.149118 1.222308
## [,7] [,8] [,9] [,10] [,11] [,12]
## [1,] 32.59423 42.583783 32.465606 42.5402783 39.373302 39.520781
## [2,] -11.28229 2.647384 -27.595578 1.5585022 -24.119388 8.198632
## [3,] -14.29959 -37.908111 -46.801081 -37.0916036 -44.650695 -26.351326
## [4,] 25.35232 2.140486 4.794053 0.6245207 -4.025119 3.115839
## [5,] -34.41662 -47.687154 -9.856736 -44.5472121 -20.512852 -51.098702
## [6,] 19.98217 -1.410014 8.796039 0.9599773 2.024163 2.140379
## [,13] [,14] [,15] [,16] [,17] [,18]
## [1,] 41.3492974 43.0444694 36.0590381 40.925094 -60.457238 -75.211616
## [2,] -19.9228038 3.8530855 -19.4372820 -5.384144 10.227393 12.114447
## [3,] -41.7314839 -37.8437578 -37.3469638 -44.321504 -5.132005 -4.193173
## [4,] 5.2151606 -0.6371225 0.8709797 -1.360416 54.828739 49.993499
## [5,] -26.8829708 -46.7057159 -29.8633453 -41.059262 6.457505 7.554507
## [6,] 0.3630771 -0.4588111 9.7093570 -5.468645 3.352128 1.254389
## [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] -44.15597 -71.769026 -44.25359 -67.307760 -63.4041693 -78.677496
## [2,] 28.95854 3.312265 22.16562 15.080366 16.6166982 14.365586
## [3,] 26.46644 -26.649087 27.54664 -2.610385 -0.3471538 -13.716782
## [4,] 65.59601 30.222790 60.37822 48.986309 60.4644378 45.680290
## [5,] -20.94215 24.672160 -21.24658 1.839369 1.3876087 13.308978
## [6,] 20.24829 -4.854346 16.41612 4.421392 4.3140449 -1.580885
## [,25] [,26] [,27] [,28] [,29] [,30]
## [1,] -57.572295 -51.236362 -86.192069 -86.246537 -87.373816 -87.971240
## [2,] 11.940728 14.762366 14.729293 10.426813 12.136567 13.584687
## [3,] -5.635827 2.572384 -6.621586 -14.990118 -17.676922 -12.221112
## [4,] 59.217052 60.766336 37.811851 33.009076 33.628887 46.917434
## [5,] 1.715368 -2.752669 9.481619 20.761441 20.158266 16.485205
## [6,] 2.364528 7.975162 -1.748872 -2.887229 -4.610575 -1.292198
## [,31] [,32] [,33] [,34] [,35] [,36]
## [1,] -94.173237 -68.975716 -77.813134 -84.70253 -75.71193 -77.406183
## [2,] 9.491519 9.127942 10.946127 12.77127 10.39398 12.863036
## [3,] -20.277492 -14.364144 -18.886451 -15.24421 -19.89104 -21.105707
## [4,] 20.373444 50.883197 39.870979 29.58295 37.50556 46.345993
## [5,] 25.763909 14.622713 19.649893 17.91162 16.61512 17.165529
## [6,] -8.174586 -2.391924 -2.672529 -3.71899 -3.76827 -2.156575
## [,37] [,38] [,39] [,40] [,41] [,42]
## [1,] -75.042931 -75.684512 -64.2192103 -72.507041 -71.516921 -77.897176
## [2,] 4.948577 7.612535 10.9887405 9.769956 8.310735 7.312644
## [3,] -35.528578 -22.388048 -5.3956002 -18.198432 -24.233036 -23.139147
## [4,] 35.975543 38.603160 45.9184759 46.751950 44.010880 29.999380
## [5,] 27.147774 23.570991 5.7929946 18.145949 19.392003 23.704514
## [6,] -8.121232 -6.048199 0.9269622 -1.784518 -4.475445 -6.759738
## [,43] [,44] [,45] [,46] [,47] [,48]
## [1,] -58.639533 -65.2141149 -70.035000 -73.7757676 -81.0525402 -78.723452
## [2,] 9.326129 11.0749525 15.221963 14.1575678 13.0177523 14.770936
## [3,] -14.214822 -15.9047645 -8.333618 -4.4617641 26.1952147 24.385261
## [4,] 47.612046 36.9834476 49.849995 41.1116866 -42.9035266 -32.929194
## [5,] 10.627387 13.6933536 6.595174 6.3655263 -15.9663464 -16.032088
## [6,] 1.913975 -0.5173279 2.113035 0.2809776 -0.1751373 2.609717
## [,49] [,50] [,51] [,52] [,53] [,54]
## [1,] -92.584069 -94.284715 -83.9090802 -85.5587565 -92.664422 -81.2806187
## [2,] 10.513449 11.331870 12.1360409 13.2063731 10.197738 12.1522335
## [3,] 16.387548 11.252822 22.3108034 20.0190004 19.328898 18.7769469
## [4,] -49.475184 -42.223194 -40.7993584 -45.4622202 -50.411798 -37.5384548
## [5,] -8.306340 -5.658941 -13.2573022 -11.9229408 -8.582672 -11.2250079
## [6,] -3.452116 -4.048821 -0.9851206 -0.8062555 -3.158114 -0.2423327
## [,55] [,56] [,57] [,58] [,59] [,60]
## [1,] -90.056835 -81.579261 -86.904904 -89.447061 -83.1749740 -92.143550
## [2,] 11.694871 17.823307 13.281555 9.397437 11.5131214 13.053308
## [3,] 7.615741 14.528072 14.490858 10.528267 26.8460968 9.109598
## [4,] -32.738793 -20.833561 -37.929329 -39.344576 -35.7292839 -28.578189
## [5,] -3.133610 -12.390794 -7.951112 -5.321080 -17.1183938 -1.917909
## [6,] -3.848836 2.016966 -2.073287 -4.180562 0.4460951 -3.096856
## [,61] [,62] [,63] [,64] [,65] [,66]
## [1,] -91.456078 -88.887046 -90.233124 -96.744910 -86.6261752 -91.997996
## [2,] 11.462692 11.013860 12.539940 9.674577 14.2762227 12.096475
## [3,] 15.343854 17.705234 23.991978 11.858034 20.4639947 17.439556
## [4,] -38.619974 -46.991836 -45.646917 -50.055940 -35.8553309 -50.872410
## [5,] -7.713613 -6.639571 -10.831373 -2.144098 -10.9904606 -3.666205
## [6,] -3.252670 -2.838944 -2.043479 -5.495929 -0.1032774 -3.078567
## [,67] [,68] [,69] [,70] [,71] [,72]
## [1,] -79.4161180 -83.114347 -80.276192 -93.332010 -85.7186866 -90.666677
## [2,] 15.1517286 11.726002 15.941913 10.517398 14.5372751 12.347679
## [3,] 14.4623966 12.004229 24.305786 16.263302 23.9922845 15.236269
## [4,] -29.6245948 -36.001964 -36.183648 -50.590173 -35.5550203 -42.342573
## [5,] -11.9008018 -7.563350 -13.249556 -5.051157 -11.2256703 -6.278001
## [6,] 0.4639882 -2.243848 2.983838 -2.677211 0.8018736 -3.734171
## [,73] [,74] [,75] [,76] [,77] [,78]
## [1,] -87.632737 33.314366 34.28248 32.714225 31.789337 31.369118
## [2,] 11.704497 -8.306140 -37.40025 -36.313390 -5.398339 -32.592026
## [3,] 11.799841 -48.553364 -60.10559 -51.237283 -34.228947 -54.631349
## [4,] -41.688734 -6.546652 -13.81186 -13.245673 -9.158657 -14.825575
## [5,] -6.613230 -40.432225 -10.08654 -14.575518 -44.787446 -15.059038
## [6,] -3.659209 -8.391221 -12.53913 -6.311445 -4.592069 -2.853735
## [,79] [,80] [,81] [,82] [,83] [,84]
## [1,] 33.749103 28.381207 41.432926 33.662836 26.60776 26.631732
## [2,] -34.709635 -35.444291 -24.036766 1.336311 -34.24782 -37.927647
## [3,] -64.470246 -66.507667 -40.951957 -37.044633 -62.86029 -68.176073
## [4,] -17.040738 -22.850756 -7.273004 -3.124432 -13.83541 -5.751024
## [5,] -13.387400 -12.456951 -19.145823 -46.788572 -10.09864 -9.124225
## [6,] -8.389626 -7.064372 -4.283177 -5.738806 -8.34524 -5.714543
## [,85] [,86] [,87] [,88] [,89] [,90]
## [1,] 40.834528 40.514163 46.907726 40.231520 40.105127 40.782618
## [2,] -27.513912 -24.728950 -46.055433 -47.008404 -50.325528 -40.268213
## [3,] -45.459729 -21.357504 35.515171 34.943135 29.844441 37.013588
## [4,] -10.242214 -3.836492 4.260908 5.781422 1.787436 2.807525
## [5,] -10.057771 -8.391107 14.035649 13.189788 15.623377 14.045943
## [6,] -3.090973 1.515807 -19.394340 -16.115001 -19.773190 -13.337995
## [,91] [,92] [,93] [,94] [,95] [,96]
## [1,] 43.406114 37.085924 36.760032 34.484704 31.432023 35.359492
## [2,] -49.716762 -40.529108 -49.519598 -46.839334 -40.090512 -51.937781
## [3,] 25.559645 28.985812 36.552190 48.857606 57.579230 35.029694
## [4,] -1.903568 12.375724 2.680657 3.776674 10.933010 -1.627285
## [5,] 14.580099 3.103755 9.966710 10.162979 5.373957 17.336728
## [6,] -20.096101 -14.849499 -22.431974 -22.285589 -15.004971 -21.961573
## [,97] [,98] [,99] [,100] [,101] [,102]
## [1,] 40.646748 39.057892 35.283735 34.842447 37.31653 34.005723
## [2,] -48.975159 -44.751827 -48.952934 -32.405206 -36.24102 -46.997263
## [3,] 46.328410 36.178407 48.040356 56.189465 45.46078 33.546115
## [4,] 1.559785 5.638875 5.452021 11.702125 13.80448 4.231480
## [5,] 17.536384 10.211294 14.866807 -8.154364 -4.43691 7.816503
## [6,] -20.474456 -13.970778 -23.345771 -17.155983 -13.86496 -21.497009
## [,103] [,104] [,105] [,106] [,107] [,108]
## [1,] 36.1632209 33.670727 34.681902 32.871491 35.1441322 31.541488
## [2,] -46.8574689 -47.674143 -48.934565 -25.880847 -53.7877641 -48.703807
## [3,] 43.9073132 40.436640 45.837381 58.396256 36.6482569 33.993728
## [4,] -0.7209663 10.024784 5.237659 21.756277 0.8467065 12.644877
## [5,] 13.0948583 6.304818 9.561762 -16.204317 15.8793871 6.152819
## [6,] -21.0891764 -22.173333 -26.820179 -8.213545 -25.6033224 -20.161207
## [,109] [,110] [,111] [,112] [,113] [,114]
## [1,] 37.466705 38.977120 35.74300 36.373310 42.655368 44.015622
## [2,] -40.235347 -49.599028 -29.30029 -44.047932 -46.655314 -44.594376
## [3,] 42.260781 30.699038 42.49971 43.396432 36.540274 41.327025
## [4,] 7.536322 4.897427 10.91812 4.636752 -1.412658 3.311874
## [5,] 9.002062 19.361876 -10.55998 9.162079 14.847425 10.063732
## [6,] -15.995086 -20.770830 -12.45712 -18.175875 -17.236909 -15.900752
## [,115] [,116] [,117] [,118] [,119] [,120]
## [1,] 35.426895 41.629626 37.834989 33.764086 35.671487 34.347328
## [2,] -44.598570 -46.008158 -41.892543 -38.008929 -38.842025 -36.300604
## [3,] 48.853194 27.580561 44.965919 41.173175 36.832561 45.044267
## [4,] 7.913213 5.247728 6.599312 13.829884 11.948720 11.692827
## [5,] 8.343913 8.463389 8.738752 -2.001485 -2.790566 -1.172338
## [6,] -24.497354 -16.958048 -16.254917 -16.159725 -15.013996 -11.046586
## [,121] [,122] [,123] [,124] [,125] [,126]
## [1,] 28.4388490 28.8657695 30.322130 68.435864 68.972434 68.2083968
## [2,] -16.5643725 -16.5168836 -1.131207 65.203459 67.924514 65.1255031
## [3,] -17.4005896 -15.7104024 13.283234 11.657900 17.288736 13.8415857
## [4,] -2.7888013 -3.8353698 10.726409 -1.502951 2.185966 0.5892529
## [5,] -32.0634122 -33.0245579 -47.862244 6.337016 1.643290 4.9071939
## [6,] 0.6584198 0.8709017 14.786558 -6.964477 -3.204043 -5.9392508
## [,127] [,128] [,129] [,130] [,131] [,132]
## [1,] 27.941744 28.725684 27.521909 72.30697 78.263842 73.34954
## [2,] -25.123973 -23.393290 -26.929069 55.65595 67.595255 55.16477
## [3,] -33.413997 -30.169395 -37.623886 -19.19935 -1.470898 -22.28375
## [4,] -11.766005 -10.537170 -14.253977 -15.85260 -6.545949 -18.40324
## [5,] -23.952314 -23.813260 -22.101393 31.33529 10.944371 34.26888
## [6,] -5.062072 -2.963673 -6.178614 -22.04901 -6.874094 -23.23595
## [,133] [,134] [,135] [,136] [,137] [,138]
## [1,] 29.354389 40.1114167 -64.9830154 -71.3409020 68.006301 68.20969
## [2,] -10.127824 -0.5657104 22.7268600 12.7558427 68.436793 65.78385
## [3,] -3.354383 -7.1413694 29.3723204 0.6453371 13.456476 -20.11513
## [4,] 2.478723 6.2103351 -0.9485795 20.7933695 -1.689245 -24.04656
## [5,] -38.194417 -34.3383585 -21.1974608 0.8135168 7.978905 42.11802
## [6,] 5.534724 7.1165564 13.7670883 3.1881075 -8.278970 -16.89485
## [,139] [,140] [,141] [,142] [,143] [,144]
## [1,] 70.3674459 72.27429 69.447949 78.7482696 -28.82868 -26.26544
## [2,] 75.2676823 61.84340 77.721061 79.0253794 41.23766 39.50850
## [3,] -0.4037094 -21.77508 14.702300 0.5757421 65.48720 66.01022
## [4,] -14.5276786 -21.65319 -1.693676 -9.4852261 14.81416 17.42570
## [5,] 26.7780028 36.64010 7.757327 20.7927812 -54.01261 -55.32920
## [6,] -8.6661990 -15.44135 2.421507 0.1064769 37.77934 37.70582
## [,145] [,146] [,147] [,148] [,149] [,150]
## [1,] 60.17189 64.03324 41.15777 41.47376 30.101725 11.779245
## [2,] 79.64614 83.65719 40.59086 35.43083 -33.310918 -25.136022
## [3,] 63.64943 62.82029 51.22174 36.90235 -6.745666 -21.975715
## [4,] 29.99845 26.69236 34.22890 28.89347 -5.787455 -13.354616
## [5,] -44.44604 -43.83300 -78.30757 -72.37405 2.259030 9.298516
## [6,] 33.12262 34.80677 40.01084 33.67397 14.178404 9.623824
## [,151] [,152] [,153] [,154] [,155] [,156]
## [1,] 29.460157 28.070405 30.926110 20.055650 26.339439 29.059920268
## [2,] -39.394487 -34.995285 -35.976172 -41.091805 -30.446289 -30.358328639
## [3,] -10.180139 2.656477 -7.479663 -15.485143 4.283788 -0.008798855
## [4,] -11.335027 -4.039199 -12.669814 -7.580626 -4.402318 -7.290759338
## [5,] 10.834789 1.275933 9.835466 8.212026 7.278833 -2.980116158
## [6,] 8.511958 9.938846 10.082679 7.123183 10.132298 11.299425693
## [,157] [,158] [,159] [,160] [,161] [,162]
## [1,] 15.456020 15.680297 14.607636 7.486508 12.608877 27.910549
## [2,] -35.024631 -31.902263 -23.354183 -35.326185 -26.395047 -31.837675
## [3,] -35.416566 -32.715015 -14.094522 -39.911163 -34.039330 -28.351016
## [4,] -6.962547 -2.090082 1.691843 -5.639383 -4.526072 -2.860365
## [5,] 10.067893 5.851227 -5.564587 11.617660 5.496256 12.290105
## [6,] 7.335580 12.067746 11.829491 7.991133 6.636579 6.922362
## [,163] [,164] [,165] [,166] [,167] [,168]
## [1,] 24.703197 85.00936643 81.224927 80.541713 83.31780 85.00936643
## [2,] -35.046177 83.19134243 88.304562 78.209180 83.78608 83.19134243
## [3,] -32.814228 0.06082154 -1.280238 1.463048 -14.93206 0.06082154
## [4,] -5.950835 -2.58632020 -6.366312 -3.378867 -14.96246 -2.58632020
## [5,] 9.297420 20.22444755 22.001401 18.966904 32.37009 20.22444755
## [6,] 6.972852 -6.66505693 -8.463999 -5.419126 -13.19450 -6.66505693
## [,169] [,170] [,171] [,172] [,173] [,174]
## [1,] 80.541713 83.31780 81.224927 78.120610 80.591847 78.733247
## [2,] 78.209180 83.78608 88.304562 85.483368 87.517638 95.458838
## [3,] 1.463048 -14.93206 -1.280238 -6.650727 -5.974910 8.931424
## [4,] -3.378867 -14.96246 -6.366312 -10.897758 -13.303653 -3.712412
## [5,] 18.966904 32.37009 22.001401 21.332763 25.249790 11.860348
## [6,] -5.419126 -13.19450 -8.463999 -9.813258 -9.396266 -4.729341
## [,175] [,176] [,177] [,178] [,179] [,180]
## [1,] 83.268211 -24.4119520 -53.368473 -49.901942 -40.819934 -37.1520758
## [2,] 91.883614 -7.9618118 -5.207480 -13.717189 -14.424506 -10.9835337
## [3,] -0.806764 -30.6437611 -33.272994 -50.925211 -45.942245 -38.4919417
## [4,] -12.998304 3.0841163 -4.833968 -4.968734 -5.143056 0.6631418
## [5,] 20.569930 24.5018735 26.602936 38.158191 31.630936 24.3114681
## [6,] -7.205963 -0.3620428 -5.450260 -10.981845 -6.737838 -4.6758068
## [,181] [,182] [,183] [,184] [,185] [,186]
## [1,] -89.061675 -79.608354 -99.826173 30.35769 24.929772 39.67329
## [2,] 10.379737 -1.401949 6.971805 -58.47980 -51.726650 -44.62706
## [3,] 1.412092 -25.064365 -6.961008 -12.30664 -6.589041 10.25816
## [4,] -41.059393 -21.325554 -49.302884 -25.87458 -20.229914 -18.87999
## [5,] 3.968298 14.846073 12.289889 66.68990 57.532339 49.29995
## [6,] -2.558079 -12.710234 -6.324220 86.66681 80.026494 91.76388
## [,187] [,188] [,189]
## [1,] 43.96331 45.48406 51.89815
## [2,] -36.27300 -24.41609 -22.01634
## [3,] 11.02132 21.33785 27.14429
## [4,] -17.87021 -15.08956 -10.94304
## [5,] 48.88683 36.52675 34.96074
## [6,] 99.18317 101.97758 108.37173
#t(s$d * s$v) # incorrect
#s$v * s$d # incorrectSVD Exercises #3
8
A:tcrossprod(s$u,vd)
Explanation:Order of operations in R go left to right and s\(u %*% s\)d is multiplying non-conformable matrices
SVD Exercises #4
9
z = s$d * t(s$v)
sqrt(crossprod(e[,3]-e[,45]))## [,1]
## [1,] 152.5662
sqrt(crossprod(y[,3]-y[,45]))## [,1]
## [1,] 152.5662
sqrt(crossprod(z[,3]-z[,45]))## [,1]
## [1,] 152.5662
Note that the columns z have 189 entries, compared to 22,215 for e.
What is the difference (in absolute value) between the actual distance sqrt(crossprod(e[,3]-e[,45])) and the approximation using only two dimension of z
realdistance = sqrt(crossprod(e[,3]-e[,45]))
approxdistance = sqrt(crossprod(z[1:2,3]-z[1:2,45]))
abs(realdistance - approxdistance)## [,1]
## [1,] 40.62416
A:40.62416
SVD Exercises #5
What is the minium number of dimensions we need to use for the approximation in SVD Exercises #4 to be within 10% or less?
ks = 1:189
realdistance = sqrt(crossprod(e[,3]-e[,45]))
approxdistances = sapply(ks,function(k){
sqrt(crossprod(z[1:k,3,drop=FALSE]-z[1:k,45,drop=FALSE] ))
})
percentdiff = 100*abs(approxdistances - realdistance)/realdistance
plot(ks,percentdiff) ##take a lookmin(ks[which(percentdiff < 10)])## [1] 7
A:7
SVD Exercises #6
Compute distances between sample 3 and all other samples:
distances = sqrt(apply(e[,-3]-e[,3],2,crossprod))Recompute this distance using the 2 dimensional approximation.
What is the Spearman correlation between this approximate distance and the actual distance?
approxdistances = sqrt(apply(z[1:2,-3]-z[1:2,3],2,crossprod))
plot(distances,approxdistances) ##take a lookcor(distances,approxdistances,method="spearman")## [1] 0.8598592
#Note that this shows how just two dimensions can be useful to get an idea about the actual distances.A:0.8598592
http://genomicsclass.github.io/book/pages/mds.html
We will motivate multi-dimensional scaling (MDS) plots with a gene expression example. To simplify the illustration we will only consider three tissues:
library(rafalib)
library(tissuesGeneExpression)
data(tissuesGeneExpression)
colind <- tissue%in%c("kidney","colon","liver")
mat <- e[,colind]
group <- factor(tissue[colind])
dim(mat)## [1] 22215 99
As an exploratory step, we wish to know if gene expression profiles stored in the columns of mat show more similarity between tissues than across tissues. Unfortunately, as mentioned above, we can’t plot multi-dimensional points. In general, we prefer two-dimensional plots, but making plots for every pair of genes or every pair of samples is not practical. MDS plots become a powerful tool in this situation.
Now that we know about SVD and matrix algebra, understanding MDS is relatively straightforward. For illustrative purposes let’s consider the SVD decomposition:
\[\mathbf{Y} = \mathbf{UDV}^\top\]
and assume that the sum of squares of the first two columns \(\mathbf{U^\top Y=DV^\top}\) is much larger than sum of squares of all other columns. This can be written as: \(d_1+ d_2 \gg d_3 + \dots + d_n\) with \(d_i\) the i-th entry of the \(\mathbf{D}\) matrix. When this happens, we then have:
\[\mathbf{Y}\approx [\mathbf{U}_1 \mathbf{U}_2] \begin{pmatrix} d_{1}&0\\ 0&d_{2}\\ \end{pmatrix} [\mathbf{V}_1 \mathbf{V}_2]^\top \]
This implies that column \(i\) is approximately:
\[ \mathbf{Y}_i \approx [\mathbf{U}_1 \mathbf{U}_2] \begin{pmatrix} d_{1}&0\\ 0&d_{2}\\ \end{pmatrix} \begin{pmatrix} v_{i,1}\\ v_{i,2}\\ \end{pmatrix} = [\mathbf{U}_1 \mathbf{U}_2] \begin{pmatrix} d_{1} v_{i,1}\\ d_{2} v_{i,2} \end{pmatrix} \]
If we define the following two dimensional vector…
\[\mathbf{Z}_i=\begin{pmatrix} d_{1} v_{i,1}\\ d_{2} v_{i,2} \end{pmatrix} \]
… then
\[ \begin{align*} (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) &\approx \left\{ [\mathbf{U}_1 \mathbf{U}_2] (\mathbf{Z}_i-\mathbf{Z}_j) \right\}^\top \left\{[\mathbf{U}_1 \mathbf{U}_2] (\mathbf{Z}_i-\mathbf{Z}_j)\right\}\\ &= (\mathbf{Z}_i-\mathbf{Z}_j)^\top [\mathbf{U}_1 \mathbf{U}_2]^\top [\mathbf{U}_1 \mathbf{U}_2] (\mathbf{Z}_i-\mathbf{Z}_j) \\ &=(\mathbf{Z}_i-\mathbf{Z}_j)^\top(\mathbf{Z}_i-\mathbf{Z}_j)\\ &=(Z_{i,1}-Z_{j,1})^2 + (Z_{i,2}-Z_{j,2})^2 \end{align*} \]
This derivation tells us that the distance between samples \(i\) and \(j\) is approximated by the distance between two dimensional points.
\[ (\mathbf{Y}_i - \mathbf{Y}_j)^\top(\mathbf{Y}_i - \mathbf{Y}_j) \approx (Z_{i,1}-Z_{j,1})^2 + (Z_{i,2}-Z_{j,2})^2 \]
Because \(Z\) is a two dimensional vector, we can visualize the distances between each sample by plotting \(\mathbf{Z}_1\) versus \(\mathbf{Z}_2\) and visually inspect the distance between points. Here is this plot for our example dataset:
s <- svd(mat-rowMeans(mat))
z<-diag(s$d[1:2]) %*% t(s$v[,1:2])
dim(z)## [1] 2 99
z<-t(z)
plot(z)Multi-dimensional scaling (MDS) plot for tissue gene expression data.
library(rafalib)
mypar(1,1)
plot(z[,1],z[,2],pch=21,bg=as.numeric(group))
legend("bottomright",levels(group),col=seq(along=levels(group)),pch=15,cex=1.5)Multi-dimensional scaling (MDS) plot for tissue gene expression data.
PC1 <- s$d[1]*s$v[,1]
PC2 <- s$d[2]*s$v[,2]
mypar(1,1)
plot(PC1,PC2,pch=21,bg=as.numeric(group))
legend("bottomright",levels(group),col=seq(along=levels(group)),pch=15,cex=1.5)Multi-dimensional scaling (MDS) plot for tissue gene expression data.
Note that the points separate by tissue type as expected. Now the accuracy of the approximation above depends on the proportion of variance explained by the first two principal components. As we showed above, we can quickly see this by plotting the variance explained plot:
plot(s$d^2/sum(s$d^2))Variance explained for each principal component.
Although the first two PCs explain over 50% of the variability, there is plenty of information that this plot does not show. However, it is an incredibly useful plot for obtaining, via visualization, a general idea of the distance between points. Also, notice that we can plot other dimensions as well to search for patterns. Here are the 3rd and 4th PCs:
PC3 <- s$d[3]*s$v[,3]
PC4 <- s$d[4]*s$v[,4]
mypar(1,1)
plot(PC3,PC4,pch=21,bg=as.numeric(group))
legend("bottomright",levels(group),col=seq(along=levels(group)),pch=15,cex=1.5)Third and fourth principal components.
Note that the 4th PC shows a strong separation within the kidney samples. Later we will learn about batch effects, which might explain this finding.
cmdscaleAlthough we used the svd functions above, there is a special function that is specifically made for MDS plots. It takes a distance object as an argument and then uses principal component analysis to provide the best approximation to this distance that can be obtained with \(k\) dimensions. This function is more efficient because one does not have to perform the full SVD, which can be time consuming. By default it returns two dimensions, but we can change that through the parameter k which defaults to 2.
d <- dist(t(mat))
mds <- cmdscale(d)
mypar()
plot(mds[,1],mds[,2],bg=as.numeric(group),pch=21,
xlab="First dimension",ylab="Second dimension")
legend("bottomleft",levels(group),col=seq(along=levels(group)),pch=15) These two approaches are equivalent up to an arbitrary sign change.
mypar(1,2)
for(i in 1:2){
plot(mds[,i],s$d[i]*s$v[,i],main=paste("PC",i))
b = ifelse( cor(mds[,i],s$v[,i]) > 0, 1, -1)
abline(0,b) ##b is 1 or -1 depending on the arbitrary sign "flip"
}Comparison of MDS first two PCs to SVD first two PCs.
** Why the arbitrary sign?**
The SVD is not unique because we can multiply any column of \(\mathbf{V}\) by -1 as long as we multiply the sample column of \(\mathbf{U}\) by -1. We can see this immediately by noting that:
\[ \mathbf{-1UD(-1)V}^\top = \mathbf{UDV}^\top \]
Why we substract the mean
In all calculations above we subtract the row means before we compute the singular value decomposition. If what we are trying to do is approximate the distance between columns, the distance between \(\mathbf{Y}_i\) and \(\mathbf{Y}_j\) is the same as the distance between \(\mathbf{Y}_i- \mathbf{\mu}\) and \(\mathbf{Y}_j - \mathbf{\mu}\) since the \(\mu\) cancels out when computing said distance:
\[ \left\{ ( \mathbf{Y}_i- \mathbf{\mu} ) - ( \mathbf{Y}_j - \mathbf{\mu} ) \right\}^\top \left\{ (\mathbf{Y}_i- \mathbf{\mu}) - (\mathbf{Y}_j - \mathbf{\mu} ) \right\} = \left\{ \mathbf{Y}_i- \mathbf{Y}_j \right\}^\top \left\{ \mathbf{Y}_i - \mathbf{Y}_j \right\} \]
Because removing the row averages reduces the total variation, it can only make the SVD approximation better.
http://genomicsclass.github.io/book/pages/mds_exercises.html
For the following queesions use the data loaded with:
library(tissuesGeneExpression) data(tissuesGeneExpression)
MDS Exercises #1
n these exercise we will demonstrate the relantionship between the SVD and the output of mdscale, the function in R that performs MDS.
Using the z we computed in SVD Exercises #4
y = e - rowMeans(e)
s = svd(y)
z = s$d * t(s$v)we can make an mds plot
library(rafalib)
ftissue = factor(tissue)
mypar(1,1)
plot(z[1,],z[2,],col=as.numeric(ftissue))
legend("topleft",levels(ftissue),col=seq_along(ftissue),pch=1)Now run the function cmdscale on the original data
d = dist(t(e))
mds = cmdscale(d)Q:What is the correlation between the first row of z and the first column in mds?
cor(z[1,],mds[,1])## [1] -1
A:-1
MDS Exercises #2
What is the correlation between the second row of z and the second column od mds?
cor(z[2,],mds[,2])## [1] -1
A:-1
MDS Exercises #3
Note that the mds plot is not the same:
library(rafalib)
ftissue = factor(tissue)
mypar(1,2)
plot(z[1,],z[2,],col=as.numeric(ftissue))
legend("topleft",levels(ftissue),col=seq_along(ftissue),pch=1)
plot(mds[,1],mds[,2],col=as.numeric(ftissue))Given the answer to MDS Exercises #1, what do we have to do to z[1,] and z[2,] to get a practically identical plot?
A:multiply z[1,] and z[2,] by -1
Explanation:From the answer to the previous question we see that we have a flipped sign in both dimensions.
MDS Exercises #4
Load the following dataset
library(GSE5859Subset)
data(GSE5859Subset)
dim(geneExpression)## [1] 8793 24
write.csv(geneExpression,"geneExpression.csv")Compute the svd and compute z
s = svd(geneExpression-rowMeans(geneExpression))
z = s$d * t(s$v)Q:Which dimension of z most correlates with the outcome sampleInfo$group?
which.max(cor(sampleInfo$g,t(z)))## [1] 1
A:1
MDS Exercises #5
Load the following dataset
library(GSE5859Subset)
data(GSE5859Subset)Compute the svd and compute z
s = svd(geneExpression-rowMeans(geneExpression))
z = s$d * t(s$v)Q:What is this max correlation?
max(cor(sampleInfo$g,t(z)))## [1] 0.6236858
A:0.6236858
MDS Exercises #6
Load the following dataset
library(GSE5859Subset)
data(GSE5859Subset)Compute the svd and compute z
s = svd(geneExpression-rowMeans(geneExpression))
z = s$d * t(s$v)Q:Which dimension of z has the second highest correlates with the outcome sampleInfo$group?
which.max(cor(sampleInfo$g,t(z))[-1]) + 1 #We add 1 because we took out the first.## [1] 6
A:6
MDS Exercises #7
Note these measurements were made during two months:
sampleInfo$date## [1] "2005-06-23" "2005-06-27" "2005-06-27" "2005-10-28" "2005-10-07"
## [6] "2005-10-07" "2005-10-28" "2005-10-07" "2005-10-07" "2005-10-07"
## [11] "2005-10-07" "2005-10-07" "2005-06-27" "2005-06-23" "2005-06-10"
## [16] "2005-06-23" "2005-06-23" "2005-06-27" "2005-06-23" "2005-06-27"
## [21] "2005-06-23" "2005-10-07" "2005-10-07" "2005-10-07"
We can extract the month this way:
month = format( sampleInfo$date, "%m")
month = factor( month)
month## [1] 06 06 06 10 10 10 10 10 10 10 10 10 06 06 06 06 06 06 06 06 06 10 10
## [24] 10
## Levels: 06 10
Q:Which dimension of z has the highest correlates with the outcome month
which.max(cor( as.numeric(month), t(z)))## [1] 1
A:1
What is this correlation?
max(cor( as.numeric(month), t(z)))## [1] 0.8297915
A: 0.8297915
Note that the same dimension is correlated with both the group and the date. Not also that these are correlated:
table(sampleInfo$g,month)## month
## 06 10
## 0 9 3
## 1 3 9
So is this first dimension related directly to group or is it related only through the month? Note that the correlation with month is higher. This is related to batch effects which we will learn about later.
MDS Exercises #8 (ADVANCED)
ote: this is an advanced question. Please feel free to discuss on the forum.
In MDS Exercises #7 we saw that that one of the dimensions was highly correlated to the sampleInfo$group. Now take the 5th column of U and stratify by the gene chromosome. Remove chrUn and make a boxplot of the values of U6 stratified by chromosome.
Which chromosome looks different from the rest? Copy and paste the name as it appears in geneAnnotation
result = split(s$u[,6],geneAnnotation$CHR)
result = result[ which(names(result)!="chrUn") ]
boxplot(result,range=0)boxplot(result,range=0,ylim=c(-0.025,0.025))medians = sapply(result,median)
names(result)[ which.max(abs(medians)) ]## [1] "chrY"
A:chrY
Given the answer to this question, any guesses as to what sampleInfo$group represents?
http://genomicsclass.github.io/book/pages/PCA.html
We have already mentioned principal component analysis (PCA) above and noted its relation to the SVD. Here we provide further mathematical details.
We started the motivation for dimension reduction with a simulated example and showed a rotation that is very much related to PCA.
Twin heights scatter plot.
Here we explain specifically what are the principal components (PCs).
Let \(\mathbf{Y}\) be \(2 \times N\) matrix representing our data. The analogy is that we measure expression from 2 genes and each column is a sample. Suppose we are given the task of finding a \(2 \times 1\) vector \(\mathbf{u}_1\) such that \(\mathbf{u}_1^\top \mathbf{v}_1 = 1\) and it maximizes \[(\mathbf{u}_1^\top\mathbf{Y})^\top (\mathbf{u}_1^\top\mathbf{Y})\]. This can be viewed as a projection of each sample or column of \(\mathbf{Y}\) into the subspace spanned by \(\mathbf{u}_1\). So we are looking for a transformation in which the coordinates show high variability.
Let’s try \(\mathbf{u}=(1,0)^\top\). This projection simply gives us the height of twin 1 shown in orange below. The sum of squares is shown in the title.
mypar(1,1)
plot(t(Y), xlim=thelim, ylim=thelim,
main=paste("Sum of squares :",round(crossprod(Y[1,]),1)))
abline(h=0)
apply(Y,2,function(y) segments(y[1],0,y[1],y[2],lty=2))## NULL
points(Y[1,],rep(0,ncol(Y)),col=2,pch=16,cex=0.75)Can we find a direction with higher variability? How about:
\(\mathbf{u} =\begin{pmatrix}1\\-1\end{pmatrix}\) ? This does not satisfy \(\mathbf{u}^\top\mathbf{u}= 1\) so let’s instead try \(\mathbf{u} =\begin{pmatrix}1/\sqrt{2}\\-1/\sqrt{2}\end{pmatrix}\)
u <- matrix(c(1,-1)/sqrt(2),ncol=1)
w=t(u)%*%Y
mypar(1,1)
plot(t(Y),
main=paste("Sum of squares:",round(tcrossprod(w),1)),xlim=thelim,ylim=thelim)
abline(h=0,lty=2)
abline(v=0,lty=2)
abline(0,-1,col=2)
Z = u%*%w
for(i in seq(along=w))
segments(Z[1,i],Z[2,i],Y[1,i],Y[2,i],lty=2)
points(t(Z), col=2, pch=16, cex=0.5)Data projected onto space spanned by (1 0).
This relates to the difference between twins, which we know is small. The sum of squares confirms this.
Finally, let’s try:
\(\mathbf{u} =\begin{pmatrix}1/\sqrt{2}\\1/\sqrt{2}\end{pmatrix}\)
u <- matrix(c(1,1)/sqrt(2),ncol=1)
w=t(u)%*%Y
mypar()
plot(t(Y), main=paste("Sum of squares:",round(tcrossprod(w),1)),
xlim=thelim, ylim=thelim)
abline(h=0,lty=2)
abline(v=0,lty=2)
abline(0,1,col=2)
points(u%*%w, col=2, pch=16, cex=1)
Z = u%*%w
for(i in seq(along=w))
segments(Z[1,i], Z[2,i], Y[1,i], Y[2,i], lty=2)
points(t(Z),col=2,pch=16,cex=0.5)Data projected onto space spanned by first PC.
This is a re-scaled average height, which has higher sum of squares. There is a mathematical procedure for determining which \(\mathbf{v}\) maximizes the sum of squares and the SVD provides it for us.
The orthogonal vector that maximizes the sum of squares:
\[(\mathbf{u}_1^\top\mathbf{Y})^\top(\mathbf{u}_1^\top\mathbf{Y})\]
\(\mathbf{u}_1^\top\mathbf{Y}\) is referred to as the first PC. The weights \(\mathbf{u}\) used to obtain this PC are referred to as the loadings. Using the language of rotations, it is also referred to as the direction of the first PC, which are the new coordinates.
To obtain the second PC, we repeat the exercise above, but for the residuals:
\[\mathbf{r} = \mathbf{Y} - \mathbf{u}_1^\top \mathbf{Yv}_1 \]
The second PC is the vector with the following properties:
\[ \mathbf{v}_2^\top \mathbf{v}_2=1\]
\[ \mathbf{v}_2^\top \mathbf{v}_1=0\]
and maximizes \((\mathbf{rv}_2)^\top \mathbf{rv}_2\).
When \(Y\) is \(N \times m\) we repeat to find 3rd, 4th, …, m-th PCs.
prcompWe have shown how to obtain PCs using the SVD. However, R has a function specifically designed to find the principal components. In this case, the data is centered by default. The following function:
pc <- prcomp( t(Y) )produces the same results as the SVD up to arbitrary sign flips:
s <- svd( Y - rowMeans(Y) )
mypar(1,2)
for(i in 1:nrow(Y) ){
plot(pc$x[,i], s$d[i]*s$v[,i])
}Plot showing SVD and prcomp give same results.
The loadings can be found this way:
pc$rotation## PC1 PC2
## [1,] 0.7072304 0.7069831
## [2,] 0.7069831 -0.7072304
which are equivalent (up to a sign flip) to:
s$u## [,1] [,2]
## [1,] -0.7072304 -0.7069831
## [2,] -0.7069831 0.7072304
The equivalent of the variance explained is included in the:
pc$sdev## [1] 1.2542672 0.2141882
component.
We take the transpose of Y because prcomp assumes the previously discussed ordering: units/samples in row and features in columns.
http://genomicsclass.github.io/book/pages/clustering_and_heatmaps.html
** Basic Machine Learning **
Machine learning is a very broad topic and a highly active research area. In the life sciences, much of what is described as “precision medicine” is an application of machine learning to biomedical data. The general idea is to predict or discover outcomes from measured predictors. Can we discover new types of cancer from gene expression profiles? Can we predict drug response from a series of genotypes? Here we give a very brief introductions to two major machine learning components: clustering and class prediction. There are many good resources to learn more about machine learning, for example this book.
We will demonstrate the concepts and code needed to perform clustering analysis with the tissue gene expression data:
library(tissuesGeneExpression)
data(tissuesGeneExpression)To illustrate the main application of clustering in the life sciences, let’s pretend that we don’t know these are different tissues and are interested in clustering. The first step is to compute the distance between each sample:
d <- dist( t(e) )
head(d)## [1] 85.85460 84.38089 101.09763 90.28968 83.28734 104.89770
There’s two types of hierarchical clustering, top down and bottom up.
With the distance between each pair of samples computed, we need clustering algorithms to join them into groups. Hierarchical clustering is one of the many clustering algorithms available to do this. Each sample is assigned to its own group and then the algorithm continues iteratively, joining the two most similar clusters at each step, and continuing until there is just one group. While we have defined distances between samples, we have not yet defined distances between groups. There are various ways this can be done and they all rely on the individual pairwise distances. The helpfile for hclust includes detailed information.
We can perform hierarchical clustering based on the distances defined above using the hclust function. This function returns an hclust object that describes the groupings that were created using the algorithm described above. The plot method represents these relationships with a tree or dendrogram:
library(rafalib)
mypar()
hc <- hclust(d)
hc##
## Call:
## hclust(d = d)
##
## Cluster method : complete
## Distance : euclidean
## Number of objects: 189
plot(hc,labels=tissue,cex=0.5)Dendrogram showing hierarchical clustering of tissue gene expression data.
Does this technique “discover” the clusters defined by the different tissues? In this plot, it is not easy to see the different tissues so we add colors by using the mypclust function from the rafalib package.
myplclust(hc, labels=tissue, lab.col=as.fumeric(tissue), cex=0.5)Dendrogram showing hierarchical clustering of tissue gene expression data with colors denoting tissues.
Visually, it does seem as if the clustering technique has discovered the tissues. However, hierarchical clustering does not define specific clusters, but rather defines the dendrogram above. From the dendrogram we can decipher the distance between any two groups by looking at the height at which the two groups split into two. To define clusters, we need to “cut the tree” at some distance and group all samples that are within that distance into groups below. To visualize this, we draw a horizontal line at the height we wish to cut and this defines that line. We use 120 as an example:
myplclust(hc, labels=tissue, lab.col=as.fumeric(tissue),cex=0.5)
abline(h=120)Dendrogram showing hierarchical clustering of tisuse gene expression data with colors denoting tissues. Horizontal line defines actual clusters.
If we use the line above to cut the tree into clusters, we can examine how the clusters overlap with the actual tissues:
hclusters <- cutree(hc, h=120)
table(true=tissue, cluster=hclusters)## cluster
## true 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## cerebellum 0 0 0 0 31 0 0 0 2 0 0 5 0 0
## colon 0 0 0 0 0 0 34 0 0 0 0 0 0 0
## endometrium 0 0 0 0 0 0 0 0 0 0 15 0 0 0
## hippocampus 0 0 12 19 0 0 0 0 0 0 0 0 0 0
## kidney 9 18 0 0 0 10 0 0 2 0 0 0 0 0
## liver 0 0 0 0 0 0 0 24 0 2 0 0 0 0
## placenta 0 0 0 0 0 0 0 0 0 0 0 0 2 4
We can also ask cutree to give us back a given number of clusters. The function then automatically finds the height that results in the requested number of clusters:
hclusters <- cutree(hc, k=8)
table(true=tissue, cluster=hclusters)## cluster
## true 1 2 3 4 5 6 7 8
## cerebellum 0 0 31 0 0 2 5 0
## colon 0 0 0 34 0 0 0 0
## endometrium 15 0 0 0 0 0 0 0
## hippocampus 0 12 19 0 0 0 0 0
## kidney 37 0 0 0 0 2 0 0
## liver 0 0 0 0 24 2 0 0
## placenta 0 0 0 0 0 0 0 6
In both cases we do see that, with some exceptions, each tissue is uniquely represented by one of the clusters. In some instances, the one tissue is spread across two tissues, which is due to selecting too many clusters. Selecting the number of clusters is generally a challenging step in practice and an active area of research.
Hiearchichal Clustering Exercises #1
Create a random matrix with no correlation in the following way:
set.seed(1)
m = 10000
n = 24
x = matrix(rnorm(m*n),m,n)
colnames(x)=1:n
head(x)## 1 2 3 4 5 6
## [1,] -0.6264538 -0.8043316 0.2353485 0.6179223 -0.2212571 0.5258908
## [2,] 0.1836433 -1.0565257 0.2448250 0.8935057 0.3517935 -0.4875444
## [3,] -0.8356286 -1.0353958 -0.6421869 -0.4277562 0.1606019 1.1382508
## [4,] 1.5952808 -1.1855604 -1.9348085 -0.2999012 -0.1240523 1.2151344
## [5,] 0.3295078 -0.5004395 1.0386957 -0.5319833 0.6598739 -0.4248307
## [6,] -0.8204684 -0.5249887 -0.2835501 1.7059816 -0.5038493 -1.4508403
## 7 8 9 10 11
## [1,] 0.3413341 -1.00203611 -1.55915937 -0.09504307 0.7914415
## [2,] 0.4136665 0.02590761 0.20166217 -0.38805939 0.3921679
## [3,] 0.1220357 -0.44814178 1.04017610 2.13657003 -0.4726670
## [4,] -1.5893806 0.84323332 0.07195772 0.55661945 -0.4579517
## [5,] -0.7874385 -0.21846310 -0.01526544 -0.59094164 -0.1681319
## [6,] -1.5920640 0.47678629 0.33938598 1.52014345 0.5856737
## 12 13 14 15 16
## [1,] 0.0145724495 0.8663644 -0.1604426 0.249371112 -1.2520152
## [2,] 1.7854043337 -0.9476952 -0.9241849 -0.335346796 -0.3351313
## [3,] 0.0002997544 0.4522428 -1.5561751 0.004405287 0.1080678
## [4,] 0.4356948690 0.2782408 0.8812202 0.986768348 -0.4717051
## [5,] -1.4076452475 1.4175945 -0.5263595 0.543575705 2.5070607
## [6,] -0.6929698698 0.6329981 -0.4627372 -0.626142823 -1.2451344
## 17 18 19 20 21 22
## [1,] 2.3150930 0.4414410 0.82485558 1.37086468 -0.2240021 0.07267180
## [2,] 1.0603800 0.4130862 0.74402087 0.54569610 -0.8039004 -0.75752112
## [3,] -0.3970672 0.8660777 -0.69009734 -1.62446330 1.1393495 0.76525702
## [4,] 0.4840034 -2.2615708 1.76900681 -0.06247283 0.9129255 -0.04403916
## [5,] -1.3584146 -0.1787018 -0.55215640 0.57021255 0.3448395 0.76030091
## [6,] 0.6370574 -0.4713582 -0.03257056 -0.31350574 0.7719104 0.40134573
## 23 24
## [1,] 1.3641735 0.4431154
## [2,] 0.2159377 0.9536824
## [3,] 1.0729733 -0.8278467
## [4,] 1.2573367 0.2431909
## [5,] -0.7143611 0.6492437
## [6,] -0.1038661 -1.4455379
Run hierarchical clustering on this data with the hclust function with default parameters to cluster the columns. Create a dendrogram.
From the dendrogram which pairs of samples are the furthest away from each other?
hc = hclust( dist( t(x)))
plot(hc) A:19 & 14
Hiearchichal Clustering Exercises #2
Set the seed at 1, set.seed(1) and replicate the creation of this matrix 100 times then perform hierarchical clustering as in the solution to question above and find the number of clusters if you use cutree at height 143. Note that this number is a random variable.
Based on the Monte Carlo simulation, what is the standard error of this random variable?
set.seed(1)
m = 10000
n = 24
nc = replicate(100,{
x = matrix(rnorm(m*n),m,n)
hc = hclust( dist( t(x)))
length(unique(cutree(hc,h=143)))
})
plot(table(nc)) ## look at the distributionpopsd(nc)## [1] 0.8986657
A:0.8986657
We can also cluster with the kmeans function to perform k-means clustering. As an example, let’s ru n k-means on the samples in the space of the first two genes:
set.seed(1)
km <- kmeans(t(e[1:2,]), centers=7)
names(km)## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
table(tissue,clusters=km$cluster)## clusters
## tissue 1 2 3 4 5 6 7
## cerebellum 0 1 8 0 6 0 23
## colon 2 11 2 15 4 0 0
## endometrium 0 3 4 0 0 0 8
## hippocampus 19 0 2 0 10 0 0
## kidney 7 8 20 0 0 0 4
## liver 0 0 0 0 0 18 8
## placenta 0 4 0 0 0 0 2
mypar(1,2)
plot(e[1,], e[2,], col=as.fumeric(tissue), pch=16)
plot(e[1,], e[2,], col=km$cluster, pch=16)Plot of gene expression for first two genes (order of appearance in data) with color representing tissue (left) and clusters found with kmeans (right).
In the first plot, color represents the actual tissues, while in the second, color represents the clusters that were defined by kmeans. We can see from tabulating the results that this particular clustering exercise did not perform well:
table(true=tissue,cluster=km$cluster)## cluster
## true 1 2 3 4 5 6 7
## cerebellum 0 1 8 0 6 0 23
## colon 2 11 2 15 4 0 0
## endometrium 0 3 4 0 0 0 8
## hippocampus 19 0 2 0 10 0 0
## kidney 7 8 20 0 0 0 4
## liver 0 0 0 0 0 18 8
## placenta 0 4 0 0 0 0 2
This is very likely due to the fact the the first two genes are not informative regarding tissue type. We can see this in the first plot above. If we instead perform k-means clustering using all of the genes, we obtain a much improved result. To visualize this, we can use an MDS plot:
km <- kmeans(t(e), centers=7)
d<-dist(t(e))
mds <- cmdscale(d)
mypar(1,2)
plot(mds[,1], mds[,2])
plot(mds[,1], mds[,2], col=km$cluster, pch=16)Plot of gene expression for first two PCs with color representing tissues (left) and clusters found using all genes (right).
By tabulating the results, we see that we obtain a similar answer to that obtained with hierarchical clustering.
table(true=tissue,cluster=km$cluster)## cluster
## true 1 2 3 4 5 6 7
## cerebellum 0 0 5 0 31 2 0
## colon 0 34 0 0 0 0 0
## endometrium 0 15 0 0 0 0 0
## hippocampus 0 0 31 0 0 0 0
## kidney 0 37 0 0 0 2 0
## liver 2 0 0 0 0 0 24
## placenta 0 0 0 6 0 0 0
K-means Exercises #1
Run kmeans with 4 centers for the blood RNA data:
library(GSE5859Subset)
data(GSE5859Subset)Set the seed to 10, set.seed(10) right before running kmeans with 5 centers.
Explore the relationship of clusters and information in sampleInfo. Which of th following best described what you find:
mds=cmdscale(dist(t(geneExpression)))
set.seed(10)
result=kmeans(t(geneExpression),5)
mypar(1,1)
plot(mds,bg=result$cl,pch=21)table(sampleInfo$group,result$cluster)##
## 1 2 3 4 5
## 0 4 1 2 1 4
## 1 1 6 3 0 2
table(sampleInfo$date,result$cluster)##
## 1 2 3 4 5
## 2005-06-10 0 0 0 1 0
## 2005-06-23 5 0 0 0 1
## 2005-06-27 0 0 0 0 5
## 2005-10-07 0 5 5 0 0
## 2005-10-28 0 2 0 0 0
##looks better if we re-order:
table(sampleInfo$date,result$cluster)[,c(4,1,5,3,2)]##
## 4 1 5 3 2
## 2005-06-10 1 0 0 0 0
## 2005-06-23 0 5 1 0 0
## 2005-06-27 0 0 5 0 0
## 2005-10-07 0 0 0 5 5
## 2005-10-28 0 0 0 0 2
A:Date is driving the clusters
Heatmaps are ubiquitous in the genomics literature. They are very useful plots for visualizing the measurements for a subset of rows over all the samples. A dendrogram is added on top and on the side that is created with hierarchical clustering. We will demonstrate how to create heatmaps from within R.
library(tissuesGeneExpression)
data(tissuesGeneExpression)
image(e[1:100,])Let’s begin by defining a color palette:
library(RColorBrewer)
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)Now, pick the genes with the top variance over all samples:
library(Biobase)
library(genefilter)
library(parathyroidSE)
rv <- rowVars(e)
idx <- order(-rv)[1:40]While a heatmap function is included in R, we recommend the heatmap.2 function from the gplots package on CRAN because it is a bit more customized. For example, it stretches to fill the window. Here we add colors to indicate the tissue on the top:
library(gplots) ##Available from CRAN
library(rafalib)
cols <- palette(brewer.pal(8, "Dark2"))[as.fumeric(tissue)]
head(cbind(colnames(e),cols))## cols
## [1,] "GSM11805.CEL.gz" "#1B9E77"
## [2,] "GSM11814.CEL.gz" "#1B9E77"
## [3,] "GSM11823.CEL.gz" "#1B9E77"
## [4,] "GSM11830.CEL.gz" "#1B9E77"
## [5,] "GSM12067.CEL.gz" "#1B9E77"
## [6,] "GSM12075.CEL.gz" "#1B9E77"
heatmap.2(e[idx,], labCol=tissue,
trace="none",
ColSideColors=cols,
col=hmcol)Heatmap created using the 40 most variable genes and the function heatmap.2.
We did not use tissue information to create this heatmap, and we can quickly see, with just 40 genes, good separation across tissues.
Heat Maps Exercises #1
Load the data:
library(GSE5859Subset)
data(GSE5859Subset)Pick the 25 genes with the highest across sample variance. This function might help
#install.packages("matrixStats")
library(matrixStats)## matrixStats v0.51.0 (2016-10-08) successfully loaded. See ?matrixStats for help.
##
## Attaching package: 'matrixStats'
## The following object is masked from 'package:SummarizedExperiment':
##
## rowRanges
## The following objects are masked from 'package:genefilter':
##
## rowSds, rowVars
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
## The following objects are masked from 'package:mosaic':
##
## count, iqr
## The following object is masked from 'package:dplyr':
##
## count
#?rowMads ##we use mads due to a outlier sampleWhile a heatmap function is included in R, we recommend the heatmap.2 function from the gplots package on CRAN because it is a bit more customized. For example, it stretches to fill the window.
library(gplots)Use heatmap.2 to make a heatmap showing the sampleInfo$group with color, the date as labels, the rows labelled with chromosome, and scaling the rows.
What do we learn from this heatmap?
##load libraries
library(rafalib)
library(gplots)
library(matrixStats)
library(RColorBrewer)
##make colors
cols = colorRampPalette(rev(brewer.pal(11,"RdBu")))(25)
gcol=brewer.pal(3,"Dark2")
gcol=gcol[sampleInfo$g+1]
##make lables: remove 2005 since it's common to all
labcol= gsub("2005-","",sampleInfo$date)
##pick highly variable genes:
sds =rowMads(geneExpression)
ind = order(sds,decreasing=TRUE)[1:25]
## make heatmap
heatmap.2(geneExpression[ind,],
col=cols,
trace="none",
scale="row",
labRow=geneAnnotation$CHR[ind],
labCol=labcol,
ColSideColors=gcol,
key=FALSE) A:A group of chrY genes are higher in group 0 and appear to drive the clustering. Within those clusters there appears to be clustering by month.
Heat Maps Exercises #2
Create a large data set of random data that is completely independent of sampleInfo$group like this:
set.seed(17)
m = nrow(geneExpression)
n = ncol(geneExpression)
x = matrix(rnorm(m*n),m,n)
g = factor(sampleInfo$g )Create two heatmaps with these data. Show the group g either with labels or colors.
Which of the following statements is true:
library(gplots)
library(matrixStats)
library(genefilter)
library(RColorBrewer)
cols = colorRampPalette(rev(brewer.pal(11,"RdBu")))(25)
ttest = rowttests(x,g)
sds = rowSds(x)
Indexes = list(t=order(ttest$p.value)[1:50], s=order(-sds)[1:50])
for(ind in Indexes){
heatmap.2(x[ind,],
col=cols,
trace="none",
scale="row",
labCol=g,
key=FALSE)
} A:There is no relationship between g and x but with 8,793 tests some will appear significant by chance. Selecting genes with the t-test gives us a deceiving result.
Throughout this assessment it will be useful to remember that when our data are 0s and 1s probabilities and expectations are the same thing. We can do the math, but here is some R code:
n = 1000
y = rbinom(n,1,0.25)
##proportion of ones Pr(Y)
sum(y==1)/length(y)## [1] 0.26
##expectaion of Y
mean(y)## [1] 0.26
Prediction problems can be divided into categorical and continuous outcomes. However, many of the algorithms can be applied to both due to the connection between conditional probabilities and conditional expectations.
For categorical data, for example binary outcomes, if we know the probability of \(Y\) being any of the possible outcomes \(k\) given a set of predictors \(X=(X_1,\dots,X_p)^\top\),
\[ f_k(x) = \mbox{Pr}(Y=k \mid X=x) \]
we can optimize our predictions. Specifically, for any \(x\) we predict the \(k\) that has the largest probability \(f_k(x)\).
To simplify the exposition below, we will consider the case of binary data. You can think of the probability \(\mbox{Pr}(Y=1 \mid X=x)\) as the proportion of 1s in the stratum of the population for which \(X=x\). Given that the expectation is the average of all \(Y\) values, in this case the expectation is equivalent to the probability: \(f(x) \equiv \mbox{E}(Y \mid X=x)=\mbox{Pr}(Y=1 \mid X=x)\). We therefore use only the expectation in the descriptions below as it is more general.
In general, the expected value has an attractive mathematical property and it is that minimized the expected distance between the predictor \(\hat{Y}\) and \(Y\):
\[ \mbox{E}\{ (\hat{Y} - Y)^2 \mid X=x \} \]
We use the son and father height example to illustrate how regression can be interpreted as a machine learning technique. In our example, we are trying to predict the son’s height \(Y\) based on the father’s \(X\). Here we have only one predictor. Now if we were asked to predict the height of a randomly selected son, we would go with the average height:
library(rafalib)
mypar(1,1)
data(father.son,package="UsingR")
x=round(father.son$fheight) ##round to nearest inch
y=round(father.son$sheight)
hist(y,breaks=seq(min(y),max(y)))
abline(v=mean(y),col="red",lwd=2)Histogram of son heights.
In this case, we can also approximate the distribution of \(Y\) as normal, which implies the mean maximizes the probability density.
Let’s imagine that we are given more information. We are told that the father of this randomly selected son has a height of 71 inches (1.25 SDs taller than the average). What is our prediction now?
mypar(1,2)
plot(x,y,xlab="Father's height in inches",ylab="Son's height in inches",main=paste("correlation =",signif(cor(x,y),2)))
abline(v=c(-0.35,0.35)+71,col="red")
hist(y[x==71],xlab="Heights",nc=8,main="",xlim=range(y))Son versus father height (left) with the red lines denoting the stratum defined by conditioning on fathers being 71 inches tall. Conditional distribution: son height distribution of stratum defined by 71 inch fathers.
The best guess is still the expectation, but our strata has changed from all the data, to only the \(Y\) with \(X=71\). So we can stratify and take the average, which is the conditional expectation. Our prediction for any \(x\) is therefore:
\[ f(x) = E(Y \mid X=x) \]
It turns out that because this data is approximated by a bivariate normal distribution, using calculus, we can show that:
\[ f(x) = \mu_Y + \rho \frac{\sigma_Y}{\sigma_X} (X-\mu_X) \]
and if we estimate these five parameters from the sample, we get the regression line:
mypar(1,2)
plot(x,y,xlab="Father's height in inches",ylab="Son's height in inches",main=paste("correlation =",signif(cor(x,y),2)))
abline(v=c(-0.35,0.35)+71,col="red")
fit <- lm(y~x)
abline(fit,col=1)
hist(y[x==71],xlab="Heights",nc=8,main="",xlim=range(y))
abline(v = fit$coef[1] + fit$coef[2]*71, col=1)Son versus father height showing predicted heights based on regression line (left). Conditional distribution with vertical line representing regression prediction.
In this particular case, the regression line provides an optimal prediction function for \(Y\). But this is not generally true because, in the typical machine learning problems, the optimal \(f(x)\) is rarely a simple line.
CONDITIONAL EXPECTATIONS
Throughout this assessment it will be useful to remember that when our data are 0s and 1s, probabilities and expectations are the same thing. We can do the math, but here is an example in the form of R code:
n = 1000
y = rbinom(n,1,0.25)
##proportion of ones Pr(Y)
sum(y==1)/length(y)## [1] 0.247
##expectaion of Y
mean(y)## [1] 0.247
Conditional Expectation Exercises #1
Generate some random data to imitate heights for men (0) and women (1):
n = 10000
set.seed(1)
men = rnorm(n,176,7) #height in centimeters
women = rnorm(n,162,7) #height in centimeters
y = c(rep(0,n),rep(1,n))
x = round(c(men,women))
##mix it up
ind = sample(seq(along=y))
y = y[ind]
x = x[ind]10
mean(y[x==176])## [1] 0.1049475
A:0.1049475
Conditional Expectation Exercises #2
11
xs = seq(160,178)
pr =sapply(xs,function(x0) mean(y[x==x0]))
plot(xs,pr)
abline(h=0.5)
abline(v=168) A:168
Here we give a brief introduction to the main task of machine learning:class prediction. In fact, many refer to class prediction as machine learning and we sometimes use the two terms interchangeably. We give a very brief introduction to this vast topic, focusing on some specific examples.
Some of the examples we give here are motivated by those in the excellent textbook The Elements of Statistical Learning: Data Mining, Inference, and Prediction, by Trevor Hastie, Robert Tibshirani and Jerome Friedman. A free PDF of this book can be found at the following URL:
http://statweb.stanford.edu/~tibs/ElemStatLearn/
Similar to inference in the context of regression, Machine Learning (ML) studies the relationships between outcomes \(Y\) and covariates \(X\). In ML, we call \(X\) the predictors or features. The main difference between ML and inference is that, in ML, we are interested mainly in predicting \(Y\) using \(X\). Statistical models are used, but while in inference we estimate and interpret model parameters, in ML they are mainly a means to an end: predicting \(Y\).
Here we introduce the main concepts needed to understand ML, along with two specific algorithms: regression and k nearest neighbors (kNN). Keep in mind that there are dozens of popular algorithms that we do not cover here.
In a previous section, we covered the very simple one-predictor case. However, most of ML is concerned with cases with more than one predictor. For illustration purposes, we move to a case in which \(X\) is two dimensional and \(Y\) is binary. We simulate a situation with a non-linear relationship using an example from the Hastie, Tibshirani and Friedman book. In the plot below, we show the actual values of \(f(x_1,x_2)=E(Y \mid X_1=x_1,X_2=x_2)\) using colors. The following code is used to create a relatively complex conditional probability function. We create the test and train data we use later (code not shown). Here is the plot of \(f(x_1,x_2)\) with red representing values close to 1, blue representing values close to 0, and yellow values in between.
library(rafalib)
library(RColorBrewer)
hmcol <- colorRampPalette(rev(brewer.pal(11, "Spectral")))(100)
mycols=c(hmcol[1],hmcol[100])
set.seed(1)
##create covariates and outcomes
##outcomes are alwasy 50 0s and 50 1s
s2=0.15
##pick means to create a non linear conditional expectation
library(MASS)
M0 <- mvrnorm(10,c(1,0),s2*diag(2)) ##generate 10 means
M1 <- rbind(mvrnorm(3,c(1,1),s2*diag(2)),
mvrnorm(3,c(0,1),s2*diag(2)),
mvrnorm(4,c(0,0),s2*diag(2)))
###funciton to generate random pairs
s<- sqrt(1/5)
N=200
makeX <- function(M,n=N,sigma=s*diag(2)){
z <- sample(1:10,n,replace=TRUE) ##pick n at random from above 10
m <- M[z,] ##these are the n vectors (2 components)
return(t(apply(m,1,function(mu) mvrnorm(1,mu,sigma)))) ##the final values
}
###create the training set and the test set
x0 <- makeX(M0)##the final values for y=0 (green)
testx0 <- makeX(M0)
x1 <- makeX(M1)
testx1 <-makeX(M1)
x <- rbind(x0,x1) ##one matrix with everything
test <- rbind(testx0,testx1)
y <- c(rep(0,N),rep(1,N)) #the outcomes
ytest <- c(rep(0,N),rep(1,N))
cols <- mycols[c(rep(1,N),rep(2,N))]
colstest <- cols
##Create a grid so we can predict all of X,Y
GS <- 150 ##grid size is GS x GS
XLIM <- c(min(c(x[,1],test[,1])),max(c(x[,1],test[,1])))
tmpx <- seq(XLIM[1],XLIM[2],len=GS)
YLIM <- c(min(c(x[,2],test[,2])),max(c(x[,2],test[,2])))
tmpy <- seq(YLIM[1],YLIM[2],len=GS)
newx <- expand.grid(tmpx,tmpy) #grid used to show color contour of predictions
###Bayes rule: best possible answer
p <- function(x){ ##probability of Y given X
p0 <- mean(dnorm(x[1],M0[,1],s)*dnorm(x[2],M0[,2],s))
p1 <- mean(dnorm(x[1],M1[,1],s)*dnorm(x[2],M1[,2],s))
p1/(p0+p1)
}
###Create the bayesrule prediction
bayesrule <- apply(newx,1,p)
colshat <- bayesrule
colshat <- hmcol[floor(bayesrule*100)+1]
mypar()
plot(x,type="n",xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
points(newx,col=colshat,pch=16,cex=0.35)Probability of Y=1 as a function of X1 and X2. Red is close to 1, yellow close 0.5 nad blue close to 0.
If we show points for which \(E(Y \mid X=x)>0.5\) in red and the rest in blue, we see the boundary region that denotes the boundary in which we switch from predicting 0 to 1.
mypar()
colshat[bayesrule>=0.5] <- mycols[2]
colshat[bayesrule<0.5] <- mycols[1]
plot(x,type="n",xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
points(newx,col=colshat,pch=16,cex=0.35)
contour(tmpx,tmpy,matrix(round(bayesrule),GS,GS),levels=c(1,2),
add=TRUE,drawlabels=FALSE)Bayes rule. The line devides part of the space for which probability is larger than 0.5 (red) and lower than 0.5 (blue).
The above plots relate to the “truth” that we do not get to see. Most ML methodology is concerned with estimating \(f(x)\). A typical first step in usually to consider a sample, referred to the training set, to estimate \(f(x)\). We will review two specific ML techniques. First, we need to review the main concept we use to evaluate the performance of these methods.
In the code (not shown) for the first plot in this chapter, we created a test and a training set. We plot them here:
#x, test, cols, and coltest were created in code that was not shown
#x is training x1 and x2, test is test x1 and x2
#cols (0=blue, 1=red) are training observations
#coltests are test observations
mypar(1,2)
plot(x,pch=21,bg=cols,xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
plot(test,pch=21,bg=colstest,xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)Training data (left) and test data (right)
You will notice that the test and train set have similar global properties since they were generated by the same random variables (more blue towards the bottom right), but are, by construction, different. The reason we create test and training sets is to detect over-training by testing on a different data than the one used to fit models or train algorithms. We will see how important this is below.
A first naive approach to this ML problem is to fit a two variable linear regression model:
##x and y were created in the code (not shown) for the first plot
#y is outcome for the training set
X1 <- x[,1] ##these are the covariates
X2 <- x[,2]
fit1 <- lm(y~X1+X2)Once we the have fitted values, we can estimate \(f(x_1,x_2)\) with \(\hat{f}(x_1,x_2)=\hat{\beta}_0 + \hat{\beta}_1x_1 +\hat{\beta}_2 x_2\). To provide an actual prediction, we simply predict 1 when \(\hat{f}(x_1,x_2)>0.5\). We now examine the error rates in the test and training sets and also plot the boundary region:
##prediction on train
yhat <- predict(fit1)
yhat <- as.numeric(yhat>0.5)
cat("Linear regression prediction error in train:",1-mean(yhat==y),"\n")## Linear regression prediction error in train: 0.295
We can quickly obtain predicted values for any set of values using the predict function:
yhat <- predict(fit1,newdata=data.frame(X1=newx[,1],X2=newx[,2]))Now we can create a plot showing where we predict 1s and where we predict 0s, as well as the boundary. We can also use the predict function to obtain predicted values for our test set. Note that nowhere do we fit the model on the test set:
colshat <- yhat
colshat[yhat>=0.5] <- mycols[2]
colshat[yhat<0.5] <- mycols[1]
m <- -fit1$coef[2]/fit1$coef[3] #boundary slope
b <- (0.5 - fit1$coef[1])/fit1$coef[3] #boundary intercept
##prediction on test
yhat <- predict(fit1,newdata=data.frame(X1=test[,1],X2=test[,2]))
yhat <- as.numeric(yhat>0.5)
cat("Linear regression prediction error in test:",1-mean(yhat==ytest),"\n")## Linear regression prediction error in test: 0.3075
plot(test,type="n",xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
abline(b,m)
points(newx,col=colshat,pch=16,cex=0.35)
##test was created in the code (not shown) for the first plot
points(test,bg=cols,pch=21)We estimate the probability of 1 with a linear regression model with X1 and X2 as predictors. The resulting prediction map is divided into parts that are larger than 0.5 (red) and lower than 0.5 (blue).
The error rates in the test and train sets are quite similar. Thus, we do not seem to be over-training. This is not surprising as we are fitting a 2 parameter model to 400 data points. However, note that the boundary is a line. Because we are fitting a plane to the data, there is no other option here. The linear regression method is too rigid. The rigidity makes it stable and avoids over training, but it also keeps the model from adapting to the non-linear relationship between \(Y\) and \(X\). We saw this before in the smoothing section. The next ML technique we consider is similar to the smoothing techniques described before.
K-nearest neighbors (kNN) is similar to bin smoothing, but it is easier to adapt to multiple dimensions. Basically, for any point \(x\) for which we want an estimate, we look for the k nearest points and then take an average of these points. This gives us an estimate of \(f(x_1,x_2)\), just like the bin smoother gave us an estimate of a curve. We can now control flexibility through \(k\). Here we compare \(k=1\) and \(k=100\).
library(class)
mypar(2,2)
for(k in c(1,100)){
##predict on train
yhat <- knn(x,x,y,k=k)
cat("KNN prediction error in train:",1-mean((as.numeric(yhat)-1)==y),"\n")
##make plot
yhat <- knn(x,test,y,k=k)
cat("KNN prediction error in test:",1-mean((as.numeric(yhat)-1)==ytest),"\n")
}## KNN prediction error in train: 0
## KNN prediction error in test: 0.375
## KNN prediction error in train: 0.2425
## KNN prediction error in test: 0.2825
To visualize why we make no errors in the train set and many errors in the test set when \(k=1\) and obtain more stable results from \(k=100\), we show the prediction regions (code not shown):
library(class)
mypar(2,2)
for(k in c(1,100)){
##predict on train
yhat <- knn(x,x,y,k=k)
##make plot
yhat <- knn(x,newx,y,k=k)
colshat <- mycols[as.numeric(yhat)]
plot(x,type="n",xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
points(newx,col=colshat,cex=0.35,pch=16)
contour(tmpx,tmpy,matrix(as.numeric(yhat),GS,GS),levels=c(1,2),
add=TRUE,drawlabels=FALSE)
points(x,bg=cols,pch=21)
title(paste("Train: KNN (",k,")",sep=""))
plot(test,type="n",xlab="X1",ylab="X2",xlim=XLIM,ylim=YLIM)
points(newx,col=colshat,cex=0.35,pch=16)
contour(tmpx,tmpy,matrix(as.numeric(yhat),GS,GS),levels=c(1,2),
add=TRUE,drawlabels=FALSE)
points(test,bg=cols,pch=21)
title(paste("Test: KNN (",k,")",sep=""))
}Prediction regions obtained with kNN for k=1 (top) and k=200 (bottom). We show both train (left) and test data (right).
When \(k=1\), we make no mistakes in the training test since every point is its closest neighbor and it is equal to itself. However, we see some islands of blue in the red area that, once we move to the test set, are more error prone. In the case \(k=100\), we do not have this problem and we also see that we improve the error rate over linear regression. We can also see that our estimate of \(f(x_1,x_2)\) is closer to the truth.
Here we include a comparison of the test and train set errors for various values of \(k\). We also include the error rate that we would make if we actually knew \(\mbox{E}(Y \mid X_1=x1,X_2=x_2)\) referred to as Bayes Rule.
We start by computing the error rates…
###Bayes Rule
yhat <- apply(test,1,p)
cat("Bayes rule prediction error in train",1-mean(round(yhat)==y),"\n")## Bayes rule prediction error in train 0.2775
bayes.error=1-mean(round(yhat)==y)
train.error <- rep(0,16)
test.error <- rep(0,16)
for(k in seq(along=train.error)){
##predict on train
yhat <- knn(x,x,y,k=2^(k/2))
train.error[k] <- 1-mean((as.numeric(yhat)-1)==y)
##prediction on test
yhat <- knn(x,test,y,k=2^(k/2))
test.error[k] <- 1-mean((as.numeric(yhat)-1)==y)
}… and then plot the error rates against values of \(k\). We also show the Bayes rules error rate as a horizontal line.
ks <- 2^(seq(along=train.error)/2)
mypar()
plot(ks,train.error,type="n",xlab="K",ylab="Prediction Error",log="x",ylim=range(c(test.error,train.error)))
lines(ks,train.error,type="b",col=4,lty=2,lwd=2)
lines(ks,test.error,type="b",col=5,lty=3,lwd=2)
abline(h=bayes.error,col=6)
legend("bottomright",c("Train","Test","Bayes"),col=c(4,5,6),lty=c(2,3,1),box.lwd=0)Prediction error in train (pink) and test (green) versus number of neighbors. The yellow line represents what one obtains with Bayes Rule.
Note that these error rates are random variables and have standard errors. In the next section we describe cross-validation which helps reduce some of this variability. However, even with this variability, the plot clearly shows the problem over over-fitting when using values lower than 20 and under-fitting with values above 100.
Smoothing is a very powerful technique used all across data analysis. It is designed to estimate \(f(x)\) when the shape is unknown, but assumed to be smooth. The general idea is to group data points that are expected to have similar expectations and compute the average, or fit a simple parametric model. We illustrate two smoothing techniques using a gene expression example.
The following data are gene expression measurements from replicated RNA samples.
##Following three packages are available from Bioconductor
library(Biobase)
library(SpikeIn)
library(hgu95acdf)## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail'
## when loading 'hgu95acdf'
## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head'
## when loading 'hgu95acdf'
data(SpikeIn95)We consider the data used in an MA-plot comparing two replicated samples ( \(Y\) = log ratios and \(X\) = averages) and take down-sample in a way that balances the number of points for different strata of \(X\) (code not shown):
library(rafalib)
mypar()
plot(X,Y)MAplot comparing gene expression from two arrays.
In the MA plot we see that \(Y\) depends on \(X\). This dependence must be a bias because these are based on replicates, which means \(Y\) should be 0 on average regardless of \(X\). We want to predict \(f(x)=\mbox{E}(Y \mid X=x)\) so that we can remove this bias. Linear regression does not capture the apparent curvature in \(f(x)\):
mypar()
plot(X,Y)
fit <- lm(Y~X)
points(X,Y,pch=21,bg=ifelse(Y>fit$fitted,1,3))
abline(fit,col=2,lwd=4,lty=2)MAplot comparing gene expression from two arrays with fitted regression line. The two colors represent positive and negative residuals.
The points above the fitted line (green) and those below (purple) are not evenly distributed. We therefore need an alternative more flexible approach.
Instead of fitting a line, let’s go back to the idea of stratifying and computing the mean. This is referred to as bin smoothing. The general idea is that the underlying curve is “smooth” enough so that, in small bins, the curve is approximately constant. If we assume the curve is constant, then all the \(Y\) in that bin have the same expected value. For example, in the plot below, we highlight points in a bin centered at 8.6, as well as the points of a bin centered at 12.1, if we use bins of size 1. We also show the fitted mean values for the \(Y\) in those bins with dashed lines (code not shown):
MAplot comparing gene expression from two arrays with bin smoother fit shown for two points.
By computing this mean for bins around every point, we form an estimate of the underlying curve \(f(x)\). Below we show the procedure happening as we move from the smallest value of \(x\) to the largest. We show 10 intermediate cases as well (code not shown):
Illustration of how bin smoothing estimates a curve. howing in 12 steps of process.
The final result looks like this (code not shown):
MA-plot with curve obtained with bin smooth shown.
There are several functions in R that implement bin smoothers. One example is ksmooth. However, in practice, we typically prefer methods that use slightly more complicated models than fitting a constant. The final result above, for example, is somewhat wiggly. Methods such as loess, which we explain next, improve on this.
Local weighted regression (loess) is similar to bin smoothing in principle. The main difference is that we approximate the local behavior with a line or a parabola. This permits us to expand the bin sizes, which stabilizes the estimates. Below we see lines fitted to two bins that are slightly larger than those we used for the bin smoother (code not shown). We can use larger bins because fitting lines provide slightly more flexibility.
MAplot comparing gene expression from two arrays with bin local regression fit shown for two points.
As we did for the bin smoother, we show 12 steps of the process that leads to a loess fit (code not shown):
Illustration of how loess estimates a curves. Showing 12 steps of the process.
The final result is a smoother fit than the bin smoother since we use larger sample sizes to estimate our local parameters (code not shown):
MA-plot with curve obtained with by loess.
The function loess performs this analysis for us:
fit <- loess(Y~X, degree=1, span=1/3)
newx <- seq(min(X),max(X),len=100)
smooth <- predict(fit,newdata=data.frame(X=newx))
mypar ()
plot(X,Y,col="darkgrey",pch=16)
lines(newx,smooth,col="black",lwd=3)Loess fitted with the loess function.
There are three other important differences between loess and the typical bin smoother. The first is that rather than keeping the bin size the same, loess keeps the number of points used in the local fit the same. This number is controlled via the span argument which expects a proportion. For example, if N is the number of data points and span=0.5, then for a given \(x\) , loess will use the 0.5*N closest points to \(x\) for the fit. The second difference is that, when fitting the parametric model to obtain \(f(x)\), loess uses weighted least squares, with higher weights for points that are closer to \(x\). The third difference is that loess has the option of fitting the local model robustly. An iterative algorithm is implemented in which, after fitting a model in one iteration, outliers are detected and downweighted for the next iteration. To use this option, we use the argument family="symmetric".
Smoothing Exercises #1
n = 10000
set.seed(1)
men = rnorm(n,176,7) #height in centimeters
women = rnorm(n,162,7) #height in centimeters
y = c(rep(0,n),rep(1,n))
x = round(c(men,women))
##mix it up
ind = sample(seq(along=y))
y = y[ind]
x = x[ind]Set the seed at 5, set.seed(5) and take a random sample of 250 individuals from the population like this:
set.seed(5)
N = 250
ind = sample(length(y),N)
Y = y[ind]
X = x[ind]12
fit=loess(Y~X)
predict(fit,newdata=data.frame(X=168))## 1
## 0.5480233
##Here is a plot
xs = seq(160,178)
Pr =sapply(xs,function(x0) mean(Y[X==x0]))
plot(xs,Pr)
fitted=predict(fit,newdata=data.frame(X=xs))
lines(xs,fitted) A:0.5480233
Smoothing Exercises #2
The loess estimate above is a random variable thus we should compute its standard error. Use Monte Carlo simulation to compute the standard error of your estimate of f(168).
Set the seed to 5, set.seed(5) and perform 1000 simulation of the computations performed in question 2.7.1. Report the the SE of the loess based estimate.
##plot plots are optional
set.seed(5)
B = 1000
N = 250
xs = seq(160,178)
plot(xs,xs,ylim=c(0,1),type="l")
res = replicate(B,{
ind = sample(length(y),N)
Y = y[ind]
X = x[ind]
fit=loess(Y~X)
##optional plots
fitted=predict(fit,newdata=data.frame(X=xs))
lines(xs,fitted)
estimate = predict(fit,newdata=data.frame(X=168))
return(estimate)
})library(rafalib)
popsd(res)## [1] 0.05755689
A:0.05755689
Here we describe cross-validation: one of the fundamental methods in machine learning for method assessment and picking parameters in a prediction or machine learning task. Suppose we have a set of observations with many features and each observation is associated with a label. We will call this set our training data. Our task is to predict the label of any new samples by learning patterns from the training data. For a concrete example, let’s consider gene expression values, where each gene acts as a feature. We will be given a new set of unlabeled data (the test data) with the task of predicting the tissue type of the new samples.
If we choose a machine learning algorithm with a tunable parameter, we have to come up with a strategy for picking an optimal value for this parameter. We could try some values, and then just choose the one which performs the best on our training data, in terms of the number of errors the algorithm would make if we apply it to the samples we have been given for training. However, we have seen how this leads to over-fitting.
Let’s start by loading the tissue gene expression dataset:
library(tissuesGeneExpression)
data(tissuesGeneExpression)For illustration purposes, we will drop one of the tissues which doesn’t have many samples:
table(tissue)## tissue
## cerebellum colon endometrium hippocampus kidney liver
## 38 34 15 31 39 26
## placenta
## 6
ind <- which(tissue != "placenta")
y <- tissue[ind]
X <- t( e[,ind] )
dim(X)## [1] 183 22215
length(y)## [1] 183
This tissue will not form part of our example.
Now let’s try out k-nearest neighbors for classification, using \(k=5\). What is our average error in predicting the tissue in the training set, when we’ve used the same data for training and for testing?
library(class)
pred <- knn(train = X, test = X, cl=y, k=5)
mean(y != pred)## [1] 0
We have no errors in prediction in the training set with \(k=5\). What if we use \(k=1\)?
pred <- knn(train=X, test=X, cl=y, k=1)
mean(y != pred)## [1] 0
Trying to classify the same observations as we use to train the model can be very misleading. In fact, for k-nearest neighbors, using k=1 will always give 0 classification error in the training set, because we use the single observation to classify itself. The reliable way to get a sense of the performance of an algorithm is to make it give a prediction for a sample it has never seen. Similarly, if we want to know what the best value for a tunable parameter is, we need to see how different values of the parameter perform on samples, which are not in the training data.
Cross-validation is a widely-used method in machine learning, which solves this training and test data problem, while still using all the data for testing the predictive accuracy. It accomplishes this by splitting the data into a number of folds. If we have \(N\) folds, then the first step of the algorithm is to train the algorithm using \((N-1)\) of the folds, and test the algorithm’s accuracy on the single left-out fold. This is then repeated N times until each fold has been used as in the test set. If we have \(M\) parameter settings to try out, then this is accomplished in an outer loop, so we have to fit the algorithm a total of \(N \times M\) times.
We will use the createFolds function from the caret package to make 5 folds of our gene expression data, which are balanced over the tissues. Don’t be confused by the fact that the createFolds function uses the same letter ‘k’ as the ‘k’ in k-nearest neighbors. These ‘k’ are totally unrelated. The caret function createFolds is asking for how many folds to create, the \(N\) from above. The ‘k’ in the knn function is for how many closest observations to use in classifying a new sample. Here we will create 10 folds:
library(caret)
set.seed(1)
idx <- createFolds(y, k=10)
sapply(idx, length)## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## 18 19 17 17 18 20 19 19 20 16
The folds are returned as a list of numeric indices. The first fold of data is therefore:
y[idx[[1]]] ##the labels## [1] "kidney" "kidney" "hippocampus" "hippocampus" "hippocampus"
## [6] "cerebellum" "cerebellum" "cerebellum" "colon" "colon"
## [11] "colon" "colon" "kidney" "kidney" "endometrium"
## [16] "endometrium" "liver" "liver"
head( X[idx[[1]], 1:3] ) ##the genes (only showing the first 3 genes...)## 1007_s_at 1053_at 117_at
## GSM12075.CEL.gz 9.966782 6.060069 7.644452
## GSM12098.CEL.gz 9.945652 5.927861 7.847192
## GSM21214.cel.gz 10.955428 5.776781 7.493743
## GSM21218.cel.gz 10.757734 5.984170 8.525524
## GSM21230.cel.gz 11.496114 5.760156 7.787561
## GSM87086.cel.gz 9.798633 5.862426 7.279199
We can see that, in fact, the tissues are fairly equally represented across the 10 folds:
sapply(idx, function(i) table(y[i]))## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09
## cerebellum 3 4 4 4 4 4 4 4 4
## colon 4 3 3 3 4 4 3 3 4
## endometrium 2 2 1 1 1 2 1 2 2
## hippocampus 3 3 3 3 3 3 4 3 3
## kidney 4 4 3 4 4 4 4 4 4
## liver 2 3 3 2 2 3 3 3 3
## Fold10
## cerebellum 3
## colon 3
## endometrium 1
## hippocampus 3
## kidney 4
## liver 2
Because tissues have very different gene expression profiles, predicting tissue with all genes will be very easy. For illustration purposes we will try to predict tissue type with just two dimensional data. We will reduce the dimension of our data using cmdscale:
library(rafalib)
mypar()
Xsmall <- cmdscale(dist(X))
plot(Xsmall,col=as.fumeric(y))
legend("topleft",levels(factor(y)),fill=seq_along(levels(factor(y))))First two PCs of the tissue gene expression data with color representing tissue. We use these two PCs as our two predictors throughout.
Now we can try out the k-nearest neighbors method on a single fold. We provide the knn function with all the samples in Xsmall except those which are in the first fold. We remove these samples using the code -idx[[1]] inside the square brackets. We then use those samples in the test set. The cl argument is for the true classifications or labels (here, tissue) of the training data. We use 5 observations to classify in our k-nearest neighbor algorithm:
pred <- knn(train=Xsmall[ -idx[[1]] , ], test=Xsmall[ idx[[1]], ], cl=y[ -idx[[1]] ], k=5)
table(true=y[ idx[[1]] ], pred)## pred
## true cerebellum colon endometrium hippocampus kidney liver
## cerebellum 2 0 0 1 0 0
## colon 0 4 0 0 0 0
## endometrium 0 0 1 0 1 0
## hippocampus 1 0 0 2 0 0
## kidney 0 0 0 0 4 0
## liver 0 0 0 0 0 2
mean(y[ idx[[1]] ] != pred)## [1] 0.1666667
Now we have some misclassifications. How well do we do for the rest of the folds?
for (i in 1:10) {
pred <- knn(train=Xsmall[ -idx[[i]] , ], test=Xsmall[ idx[[i]], ], cl=y[ -idx[[i]] ], k=5)
print(paste0(i,") error rate: ", round(mean(y[ idx[[i]] ] != pred),3)))
}## [1] "1) error rate: 0.167"
## [1] "2) error rate: 0.105"
## [1] "3) error rate: 0.118"
## [1] "4) error rate: 0.118"
## [1] "5) error rate: 0.278"
## [1] "6) error rate: 0.05"
## [1] "7) error rate: 0.105"
## [1] "8) error rate: 0.211"
## [1] "9) error rate: 0.15"
## [1] "10) error rate: 0.312"
So we can see there is some variation for each fold, with errors rates hovering around 0.1-0.3. But is k=5 the best setting for the k parameter? In order to explore the best setting for k, we need to create an outer loop, where we try different values for k, and then calculate the average test set error across all the folds.
We will try out each value of k from 1 to 12. Instead of using two for loops, we will use sapply:
set.seed(1)
ks <- 1:12
res <- sapply(ks, function(k) {
##try out each version of k from 1 to 12
res.k <- sapply(seq_along(idx), function(i) {
##loop over each of the 10 cross-validation folds
##predict the held-out samples using k nearest neighbors
pred <- knn(train=Xsmall[ -idx[[i]], ],
test=Xsmall[ idx[[i]], ],
cl=y[ -idx[[i]] ], k = k)
##the ratio of misclassified samples
mean(y[ idx[[i]] ] != pred)
})
##average over the 10 folds
mean(res.k)
})Now for each value of k, we have an associated test set error rate from the cross-validation procedure.
res## [1] 0.1978212 0.1703423 0.1882933 0.1750989 0.1613291 0.1500791 0.1552670
## [8] 0.1884813 0.1822020 0.1763197 0.1761318 0.1813197
We can then plot the error rate for each value of k, which helps us to see in what region there might be a minimal error rate:
plot(ks, res, type="o",ylab="misclassification error")Misclassification error versus number of neighbors.
Remember, because the training set is a random sample and because our fold-generation procedure involves random number generation, the “best” value of k we pick through this procedure is also a random variable. If we had new training data and if we recreated our folds, we might get a different value for the optimal k.
Finally, to show that gene expression can perfectly predict tissue, we use 5 dimensions instead of 2, which results in perfect prediction:
Xsmall <- cmdscale(dist(X),k=5)
set.seed(1)
ks <- 1:12
res <- sapply(ks, function(k) {
res.k <- sapply(seq_along(idx), function(i) {
pred <- knn(train=Xsmall[ -idx[[i]], ],
test=Xsmall[ idx[[i]], ],
cl=y[ -idx[[i]] ], k = k)
mean(y[ idx[[i]] ] != pred)
})
mean(res.k)
})
plot(ks, res, type="o",ylim=c(0,0.20),ylab="misclassification error")Misclassification error versus number of neighbors when we use five dimensions instead of 2.
Important note: we applied cmdscale to the entire dataset to create a smaller one for illustration purposes. However, in a real machine learning application, this may result in an underestimation of test set error for small sample sizes, where dimension reduction using the unlabeled full dataset gives a boost in performance. A safer choice would have been to transform the data separately for each fold, by calculating a rotation and dimension reduction using the training set only and applying this to the test set.
Load the following dataset
library(GSE5859Subset)
data(GSE5859Subset)
#And define the outcome and predictors. To make the problem more difficult we will only consider autosomal genes:
y = factor(sampleInfo$group)
X = t(geneExpression)
out = which(geneAnnotation$CHR%in%c("chrX","chrY"))
X = X[,-out]
#Note, you will also need to load the following:
library(caret)kNN and Cross Validation Exercises #1
Set the seed to 1 set.seed(1) then use the createFolds function in the caret package to create 10 folds of y.
What is the 2nd entry in the fold 3?
library(caret)
set.seed(1)
idx = createFolds(y, k=10)
idx[[3]][2]## [1] 15
sapply(idx,function(ind) table(y[ind])) ##make sure every fold has 0s and 1s## Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10
## 0 1 1 2 2 1 1 1 1 1 1
## 1 1 1 1 1 2 1 1 2 1 1
A:15
kNN and Cross Validation Exercises #2
For the following questions we are going to use kNN. We are going to consider a smaller set of predictors by filtering genes using t-tests. Specifically, we will perform a t-test and select the [mathjaxinline]m[/mathjaxinline] genes with the smallest p-values.
Let [mathjaxinline]m=8[/mathjaxinline] and [mathjaxinline]k=5[/mathjaxinline] and train kNN by leaving out the second fold idx[[2]]
How many mistakes do we make on the test set? Remember it is indispensable that you perform the ttest on the training data.
library(class)
library(genefilter)
m=8
k=5
ind = idx[[2]]
pvals = rowttests(t(X[-ind,]),factor(y[-ind]))$p.val
ind2 = order(pvals)[1:m]
predict=knn(X[-ind,ind2],X[ind,ind2],y[-ind],k=k)
sum(predict!=y[ind])## [1] 1
A:1
kNN and Cross Validation Exercises #3
Now run the code for kNN and Cross Validation Exercises #2 for all 10 folds and keep track of the errors. What is our error rate (number of errors divided by number of predictions) ?
library(class)
library(genefilter)
m=8
k=5
result = sapply(idx,function(ind){
pvals = rowttests(t(X[-ind,]),factor(y[-ind]))$p.val
ind2 = order(pvals)[1:m]
predict=knn(X[-ind,ind2],X[ind,ind2],y[-ind],k=k)
sum(predict!=y[ind])
})
sum(result)/length(y)## [1] 0.375
A:0.375
kNN and Cross Validation Exercises #4
Now we are going to select the best values of k and m. Use the expand grid function to try out the following values:
ms=2^c(1:11)
ks=seq(1,9,2)
params = expand.grid(k=ks,m=ms)Now use apply or a loop to obtain error rates for each of these pairs of parameters. Which pair of parameters minimizes the error rate?
errors = apply(params,1,function(param){
k = param[1]
m = param[2]
result = sapply(idx,function(ind){
pvals = rowttests(t(X[-ind,]),factor(y[-ind]))$p.val
ind2 = order(pvals)[1:m]
predict=knn(X[-ind,ind2],X[ind,ind2],y[-ind],k=k)
sum(predict!=y[ind])
})
sum(result)/length(y)
})
params[which.min(errors),]## k m
## 47 3 1024
##make a plot and confirm its just one min:
errors = matrix(errors,5,11)
library(rafalib)
mypar(1,1)
matplot(ms,t(errors),type="l",log="x")
legend("topright",as.character(ks),lty=seq_along(ks),col=seq_along(ks)) A:k=3
m=1024
kNN and Cross Validation Exercises #5
Repeat question kNN and Cross Validation Exercises #4 but now perform the t-test filtering before the cross validation. Note how this biases the entire result and gives us much lower estimated error rates.
What is the minimum error rate?
pvals = rowttests(t(X),factor(y))$p.val
errors = apply(params,1,function(param){
k = param[1]
m = param[2]
result = sapply(idx,function(ind){
ind2 = order(pvals)[1:m]
predict=knn(X[-ind,ind2],X[ind,ind2],y[-ind],k=k)
sum(predict!=y[ind])
})
sum(result)/length(y)
})
min(errors)## [1] 0.08333333
##make a plot and compare to previous question
errors = matrix(errors,5,11)
library(rafalib)
mypar(1,1)
matplot(ms,t(errors),type="l",log="x")
legend("topright",as.character(ks),lty=seq_along(ks),col=seq_along(ks)) A:0.08333333
kNN and Cross Validation Exercises #6
Repeat the cross-validation we performed in question kNN and Cross Validation Exercises but now instead of defining y as sampleInfo$group use:
y = factor(as.numeric(format( sampleInfo$date, "%m")=="06"))What is the minimum error rate now?
errors = apply(params,1,function(param){
k = param[1]
m = param[2]
result = sapply(idx,function(ind){
pvals = rowttests(t(X[-ind,]),factor(y[-ind]))$p.val
ind2 = order(pvals)[1:m]
predict=knn(X[-ind,ind2],X[ind,ind2],y[-ind],k=k)
sum(predict!=y[ind])
})
sum(result)/length(y)
})
min(errors)## [1] 0
##make a plot and confirm its just one min:
errors = matrix(errors,5,11)
library(rafalib)
mypar(1,1)
matplot(ms,t(errors),type="l",log="x")
legend("topright",as.character(ks),lty=seq_along(ks),col=seq_along(ks)) A:0
Note that we achieve much lower error rate when predicting date than when predicting the group. Because group is confounded with date, it is very possible that these predictors have no information about group and that our lower 0.5 error rates are due to the confounding with date. We will learn more about this in the batch effects section.
https://github.com/genomicsclass
devtools::session_info()## Session info --------------------------------------------------------------
## setting value
## version R version 3.3.2 (2016-10-31)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_India.1252
## tz Asia/Calcutta
## date 2017-01-08
## Packages ------------------------------------------------------------------
## package * version date
## affy * 1.52.0 2016-10-18
## affyio 1.44.0 2016-10-18
## annotate 1.52.1 2016-12-22
## AnnotationDbi 1.36.0 2016-10-18
## assertthat 0.1 2013-12-06
## backports 1.0.4 2016-10-24
## base64enc * 0.1-3 2015-07-28
## Biobase * 2.34.0 2016-10-18
## BiocGenerics * 0.20.0 2016-10-18
## BiocInstaller 1.24.0 2016-10-18
## bitops 1.0-6 2013-08-17
## car 2.1-4 2016-12-02
## caret * 6.0-73 2016-11-10
## caTools 1.17.1 2014-09-10
## class * 7.3-14 2015-08-30
## codetools 0.2-15 2016-10-05
## colorspace 1.3-2 2016-12-14
## curl * 2.3 2016-11-24
## DataComputing * 0.8.3 2016-03-19
## DBI 0.5-1 2016-09-10
## devtools 1.12.0 2016-06-24
## digest 0.6.11 2017-01-03
## dplyr * 0.5.0 2016-06-24
## evaluate 0.10 2016-10-11
## foreach 1.4.3 2015-10-13
## gdata 2.17.0 2015-07-04
## genefilter * 1.56.0 2016-10-18
## GenomeInfoDb * 1.10.2 2016-12-29
## GenomicRanges * 1.26.2 2017-01-02
## ggdendro 0.1-20 2016-04-27
## ggplot2 * 2.2.1 2016-12-30
## gplots * 3.0.1 2016-03-30
## gridExtra 2.2.1 2016-02-29
## GSE5859Subset * 1.0 2016-05-09
## gtable 0.2.0 2016-02-26
## gtools 3.5.0 2015-05-29
## hgu95acdf * 2.18.0 2017-01-08
## highr 0.6 2016-05-09
## htmltools 0.3.5 2016-03-21
## IRanges * 2.8.1 2016-11-08
## iterators 1.0.8 2015-10-13
## KernSmooth 2.23-15 2015-06-29
## knitr * 1.15.1 2016-11-22
## lattice * 0.20-34 2016-09-06
## lazyeval 0.2.0 2016-06-12
## lme4 1.1-12 2016-04-16
## lubridate * 1.6.0 2016-09-13
## magrittr 1.5 2014-11-22
## manipulate * 1.0.1 2014-12-24
## MASS * 7.3-45 2016-04-21
## Matrix * 1.2-7.1 2016-09-01
## MatrixModels 0.4-1 2015-08-22
## matrixStats * 0.51.0 2016-10-09
## memoise 1.0.0 2016-01-29
## mgcv 1.8-15 2016-09-14
## mime 0.5 2016-07-07
## minqa 1.2.4 2014-10-09
## ModelMetrics 1.1.0 2016-08-26
## mosaic * 0.14.4 2016-07-29
## mosaicData * 0.14.0 2016-06-17
## munsell 0.4.3 2016-02-13
## nlme 3.1-128 2016-05-10
## nloptr 1.0.4 2014-08-04
## nnet 7.3-12 2016-02-02
## parathyroidSE * 1.12.0 2017-01-08
## pbkrtest 0.4-6 2016-01-27
## plyr 1.8.4 2016-06-08
## preprocessCore 1.36.0 2016-10-18
## quantreg 5.29 2016-09-04
## R6 2.2.0 2016-10-05
## rafalib * 1.0.0 2015-08-09
## RColorBrewer * 1.1-2 2014-12-07
## Rcpp 0.12.8 2016-11-17
## RCurl 1.95-4.8 2016-03-01
## reshape2 1.4.2 2016-10-22
## rmarkdown 1.3 2016-12-21
## rprojroot 1.1 2016-10-29
## RSQLite 1.1-1 2016-12-10
## S4Vectors * 0.12.1 2016-12-01
## scales 0.4.1 2016-11-09
## SparseM 1.74 2016-11-10
## SpikeIn * 1.16.0 2017-01-08
## stringi 1.1.2 2016-10-01
## stringr * 1.1.0 2016-08-19
## SummarizedExperiment * 1.4.0 2016-10-18
## survival 2.40-1 2016-10-30
## tibble 1.2 2016-08-26
## tidyr * 0.6.0 2016-08-12
## tissuesGeneExpression * 1.0 2016-05-13
## withr 1.0.2 2016-06-20
## XML 3.98-1.5 2016-11-10
## xtable 1.8-2 2016-02-05
## XVector 0.14.0 2016-10-18
## yaml 2.1.14 2016-11-12
## zlibbioc 1.20.0 2016-10-18
## source
## Bioconductor
## Bioconductor
## Bioconductor
## Bioconductor
## CRAN (R 3.3.0)
## CRAN (R 3.3.1)
## CRAN (R 3.3.0)
## Bioconductor
## Bioconductor
## Bioconductor
## CRAN (R 3.3.0)
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## CRAN (R 3.2.2)
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## Github (DataComputing/DataComputing@fb983c7)
## CRAN (R 3.3.1)
## CRAN (R 3.3.1)
## CRAN (R 3.3.2)
## CRAN (R 3.3.1)
## CRAN (R 3.3.1)
## CRAN (R 3.3.1)
## CRAN (R 3.2.1)
## Bioconductor
## Bioconductor
## Bioconductor
## CRAN (R 3.2.5)
## CRAN (R 3.3.2)
## CRAN (R 3.2.4)
## CRAN (R 3.2.3)
## Github (genomicsclass/GSE5859Subset@8ada5f4)
## CRAN (R 3.2.3)
## CRAN (R 3.2.0)
## Bioconductor
## CRAN (R 3.3.0)
## CRAN (R 3.2.4)
## Bioconductor
## CRAN (R 3.2.2)
## CRAN (R 3.2.1)
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## CRAN (R 3.2.5)
## CRAN (R 3.3.1)
## CRAN (R 3.2.1)
## CRAN (R 3.2.0)
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## CRAN (R 3.2.2)
## CRAN (R 3.3.2)
## CRAN (R 3.2.3)
## CRAN (R 3.3.2)
## CRAN (R 3.3.1)
## CRAN (R 3.2.1)
## CRAN (R 3.3.2)
## CRAN (R 3.3.1)
## CRAN (R 3.3.0)
## CRAN (R 3.2.3)
## CRAN (R 3.3.0)
## CRAN (R 3.2.1)
## CRAN (R 3.2.3)
## Bioconductor
## CRAN (R 3.2.3)
## CRAN (R 3.3.0)
## Bioconductor
## CRAN (R 3.3.1)
## CRAN (R 3.3.1)
## CRAN (R 3.2.2)
## CRAN (R 3.2.0)
## CRAN (R 3.3.2)
## CRAN (R 3.3.1)
## CRAN (R 3.3.1)
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## Bioconductor
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## Bioconductor
## CRAN (R 3.3.1)
## CRAN (R 3.3.1)
## Bioconductor
## CRAN (R 3.3.2)
## CRAN (R 3.3.1)
## CRAN (R 3.3.1)
## Github (genomicsclass/tissuesGeneExpression@a43cf4b)
## CRAN (R 3.3.0)
## CRAN (R 3.3.2)
## CRAN (R 3.2.3)
## Bioconductor
## CRAN (R 3.3.2)
## Bioconductor
#sessionInfo()$otherPkgs