In this article I would like to check how dimension reduction influence data classification, by examining how prediction accuracy and model computing time differ before and after applying Principal Component Analysis (PCA). Research will be conducted on colorectal cancer data from Curated Microarray Database (CuMiDa). Article is divided into 2 parts. First part is about exploratory data analysis, where mostly I will use PCA to examine the data. Second part is about fitting and predicting 5 types of models – CLARA, Kmeans, Hierarchical Clustering, SVM and Random Forest. First 3 methods are unsupervised learning, clustering algorithms, where SVM and Random Forest are supervised methods, widely used in many fields that produce very often good results.
Reading libraries
library(data.table)
library(randomForest)
library(factoextra)
library(e1071)
library(flexclust)
library(tidyverse)
library(class)
library(clusterSim)
library(gridExtra)
library(ggplot2)
library(knitr)
Let`s read the data and check dimensionality.
df <- fread('Colorectal_GSE44076.csv')
dim(df)
## [1] 194 49388
As we can see there are 194 rows and 49388 columns. It is not good that data have more columns than rows, but in our case we want to show how pca affect classification, the more columns to reduce the better. Let`s summarise first 6 columns and check how some part of the data looks.
summary(df[,1:6]) %>% kable()
| samples | type | 11715100_at | 11715101_s_at | 11715102_x_at | 11715103_x_at | |
|---|---|---|---|---|---|---|
| Min. :648.0 | Length:194 | Min. :2.313 | Min. :3.035 | Min. :2.399 | Min. :2.464 | |
| 1st Qu.:696.2 | Class :character | 1st Qu.:3.019 | 1st Qu.:3.844 | 1st Qu.:3.096 | 1st Qu.:3.395 | |
| Median :745.5 | Mode :character | Median :3.275 | Median :4.094 | Median :3.399 | Median :3.637 | |
| Mean :745.3 | NA | Mean :3.402 | Mean :4.177 | Mean :3.479 | Mean :3.673 | |
| 3rd Qu.:794.8 | NA | 3rd Qu.:3.601 | 3rd Qu.:4.419 | 3rd Qu.:3.682 | 3rd Qu.:3.846 | |
| Max. :843.0 | NA | Max. :6.209 | Max. :6.033 | Max. :5.927 | Max. :5.434 |
As we can see first column ‘samples’ consist of unique values of data, ‘type’ column is the type of cancer and all other columns are numeric description of each cancer value. Let`s check the types of cancer.
table(df$type)
##
## adenocarcinoma normal
## 97 97
Perfect we got balanced data. There are 2 types of cancer: adenocarcinoma and normal. Let`s check if there are any missing values.
sum(is.na(df))
## [1] 0
There are no missing values. Let`s try to explore our data using PCA. I will only use PCA because MDS algorithm is not optimized, it took too much time to finish and my patience ran out.
PCA_1 <- prcomp(df[,3:ncol(df)],center=T,scale.=T)
summary(PCA_1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 91.7704 63.77123 46.81270 38.00408 34.51319 33.08278
## Proportion of Variance 0.1705 0.08235 0.04437 0.02925 0.02412 0.02216
## Cumulative Proportion 0.1705 0.25288 0.29725 0.32650 0.35062 0.37278
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 28.59703 26.66069 24.98316 23.57632 23.07820 22.36457
## Proportion of Variance 0.01656 0.01439 0.01264 0.01126 0.01078 0.01013
## Cumulative Proportion 0.38934 0.40373 0.41637 0.42762 0.43841 0.44853
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 21.48093 21.34849 21.26635 20.27821 20.13703 19.80963
## Proportion of Variance 0.00934 0.00923 0.00916 0.00833 0.00821 0.00795
## Cumulative Proportion 0.45788 0.46711 0.47626 0.48459 0.49280 0.50075
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 19.54459 19.07579 18.82829 18.43063 18.36135 17.87679
## Proportion of Variance 0.00773 0.00737 0.00718 0.00688 0.00683 0.00647
## Cumulative Proportion 0.50848 0.51585 0.52303 0.52991 0.53673 0.54320
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 17.59914 17.39874 17.15669 16.99220 16.83658 16.68297
## Proportion of Variance 0.00627 0.00613 0.00596 0.00585 0.00574 0.00564
## Cumulative Proportion 0.54948 0.55561 0.56157 0.56741 0.57315 0.57879
## PC31 PC32 PC33 PC34 PC35 PC36
## Standard deviation 16.36424 16.30358 16.12459 15.72515 15.54645 15.34963
## Proportion of Variance 0.00542 0.00538 0.00526 0.00501 0.00489 0.00477
## Cumulative Proportion 0.58421 0.58959 0.59486 0.59986 0.60476 0.60953
## PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 15.30302 15.25968 15.10636 14.88174 14.79385 14.65756
## Proportion of Variance 0.00474 0.00472 0.00462 0.00448 0.00443 0.00435
## Cumulative Proportion 0.61427 0.61899 0.62361 0.62809 0.63252 0.63687
## PC43 PC44 PC45 PC46 PC47 PC48
## Standard deviation 14.49396 14.43754 14.31665 14.15628 14.12032 13.90618
## Proportion of Variance 0.00425 0.00422 0.00415 0.00406 0.00404 0.00392
## Cumulative Proportion 0.64113 0.64535 0.64950 0.65356 0.65759 0.66151
## PC49 PC50 PC51 PC52 PC53 PC54
## Standard deviation 13.8780 13.72412 13.66710 13.62382 13.54581 13.5249
## Proportion of Variance 0.0039 0.00381 0.00378 0.00376 0.00372 0.0037
## Cumulative Proportion 0.6654 0.66922 0.67300 0.67676 0.68048 0.6842
## PC55 PC56 PC57 PC58 PC59 PC60
## Standard deviation 13.36582 13.29106 13.26145 13.16918 13.10496 13.03122
## Proportion of Variance 0.00362 0.00358 0.00356 0.00351 0.00348 0.00344
## Cumulative Proportion 0.68780 0.69138 0.69494 0.69845 0.70193 0.70536
## PC61 PC62 PC63 PC64 PC65 PC66
## Standard deviation 13.01426 12.93583 12.85729 12.82107 12.73200 12.61280
## Proportion of Variance 0.00343 0.00339 0.00335 0.00333 0.00328 0.00322
## Cumulative Proportion 0.70879 0.71218 0.71553 0.71886 0.72214 0.72536
## PC67 PC68 PC69 PC70 PC71 PC72
## Standard deviation 12.5781 12.53599 12.47910 12.39299 12.31477 12.27991
## Proportion of Variance 0.0032 0.00318 0.00315 0.00311 0.00307 0.00305
## Cumulative Proportion 0.7286 0.73175 0.73490 0.73801 0.74108 0.74413
## PC73 PC74 PC75 PC76 PC77 PC78
## Standard deviation 12.22577 12.20672 12.18413 12.13424 12.10711 11.99237
## Proportion of Variance 0.00303 0.00302 0.00301 0.00298 0.00297 0.00291
## Cumulative Proportion 0.74716 0.75018 0.75318 0.75617 0.75913 0.76205
## PC79 PC80 PC81 PC82 PC83 PC84
## Standard deviation 11.94448 11.90811 11.86858 11.82807 11.7608 11.72290
## Proportion of Variance 0.00289 0.00287 0.00285 0.00283 0.0028 0.00278
## Cumulative Proportion 0.76494 0.76781 0.77066 0.77349 0.7763 0.77907
## PC85 PC86 PC87 PC88 PC89 PC90
## Standard deviation 11.68228 11.61838 11.56636 11.5534 11.51436 11.48484
## Proportion of Variance 0.00276 0.00273 0.00271 0.0027 0.00268 0.00267
## Cumulative Proportion 0.78184 0.78457 0.78728 0.7900 0.79267 0.79534
## PC91 PC92 PC93 PC94 PC95 PC96
## Standard deviation 11.42446 11.40372 11.38856 11.31748 11.30114 11.27613
## Proportion of Variance 0.00264 0.00263 0.00263 0.00259 0.00259 0.00257
## Cumulative Proportion 0.79798 0.80061 0.80324 0.80583 0.80842 0.81100
## PC97 PC98 PC99 PC100 PC101 PC102
## Standard deviation 11.23166 11.17780 11.15586 11.13081 11.1163 11.08521
## Proportion of Variance 0.00255 0.00253 0.00252 0.00251 0.0025 0.00249
## Cumulative Proportion 0.81355 0.81608 0.81860 0.82111 0.8236 0.82610
## PC103 PC104 PC105 PC106 PC107 PC108
## Standard deviation 11.04388 10.98223 10.94770 10.93449 10.8920 10.79373
## Proportion of Variance 0.00247 0.00244 0.00243 0.00242 0.0024 0.00236
## Cumulative Proportion 0.82857 0.83101 0.83344 0.83586 0.8383 0.84062
## PC109 PC110 PC111 PC112 PC113 PC114
## Standard deviation 10.78307 10.75704 10.74308 10.71794 10.69231 10.6532
## Proportion of Variance 0.00235 0.00234 0.00234 0.00233 0.00231 0.0023
## Cumulative Proportion 0.84297 0.84532 0.84765 0.84998 0.85230 0.8546
## PC115 PC116 PC117 PC118 PC119 PC120
## Standard deviation 10.64455 10.61329 10.57802 10.53892 10.52904 10.47428
## Proportion of Variance 0.00229 0.00228 0.00227 0.00225 0.00224 0.00222
## Cumulative Proportion 0.85689 0.85917 0.86143 0.86368 0.86593 0.86815
## PC121 PC122 PC123 PC124 PC125 PC126
## Standard deviation 10.45034 10.4310 10.4142 10.36304 10.33496 10.31549
## Proportion of Variance 0.00221 0.0022 0.0022 0.00217 0.00216 0.00215
## Cumulative Proportion 0.87036 0.8726 0.8748 0.87693 0.87910 0.88125
## PC127 PC128 PC129 PC130 PC131 PC132
## Standard deviation 10.24543 10.23448 10.22264 10.1954 10.1831 10.13543
## Proportion of Variance 0.00213 0.00212 0.00212 0.0021 0.0021 0.00208
## Cumulative Proportion 0.88338 0.88550 0.88761 0.8897 0.8918 0.89390
## PC133 PC134 PC135 PC136 PC137 PC138
## Standard deviation 10.11073 10.08060 10.04825 10.02415 9.99209 9.95934
## Proportion of Variance 0.00207 0.00206 0.00204 0.00203 0.00202 0.00201
## Cumulative Proportion 0.89597 0.89803 0.90007 0.90211 0.90413 0.90614
## PC139 PC140 PC141 PC142 PC143 PC144 PC145
## Standard deviation 9.9331 9.90771 9.89264 9.88573 9.85026 9.84463 9.80067
## Proportion of Variance 0.0020 0.00199 0.00198 0.00198 0.00196 0.00196 0.00194
## Cumulative Proportion 0.9081 0.91012 0.91210 0.91408 0.91605 0.91801 0.91995
## PC146 PC147 PC148 PC149 PC150 PC151 PC152
## Standard deviation 9.76835 9.75409 9.72391 9.6912 9.65726 9.59472 9.57887
## Proportion of Variance 0.00193 0.00193 0.00191 0.0019 0.00189 0.00186 0.00186
## Cumulative Proportion 0.92189 0.92381 0.92573 0.9276 0.92952 0.93138 0.93324
## PC153 PC154 PC155 PC156 PC157 PC158 PC159
## Standard deviation 9.57617 9.55252 9.51417 9.45605 9.4154 9.40551 9.39375
## Proportion of Variance 0.00186 0.00185 0.00183 0.00181 0.0018 0.00179 0.00179
## Cumulative Proportion 0.93510 0.93694 0.93878 0.94059 0.9424 0.94417 0.94596
## PC160 PC161 PC162 PC163 PC164 PC165 PC166
## Standard deviation 9.37338 9.35435 9.30045 9.30020 9.28623 9.23942 9.20900
## Proportion of Variance 0.00178 0.00177 0.00175 0.00175 0.00175 0.00173 0.00172
## Cumulative Proportion 0.94774 0.94951 0.95126 0.95301 0.95476 0.95649 0.95821
## PC167 PC168 PC169 PC170 PC171 PC172 PC173
## Standard deviation 9.19762 9.18027 9.14574 9.12728 9.10385 9.08359 9.05714
## Proportion of Variance 0.00171 0.00171 0.00169 0.00169 0.00168 0.00167 0.00166
## Cumulative Proportion 0.95992 0.96163 0.96332 0.96501 0.96668 0.96835 0.97002
## PC174 PC175 PC176 PC177 PC178 PC179 PC180
## Standard deviation 9.01972 9.00658 8.95894 8.93468 8.86708 8.81541 8.78986
## Proportion of Variance 0.00165 0.00164 0.00163 0.00162 0.00159 0.00157 0.00156
## Cumulative Proportion 0.97166 0.97331 0.97493 0.97655 0.97814 0.97971 0.98128
## PC181 PC182 PC183 PC184 PC185 PC186 PC187
## Standard deviation 8.78160 8.72457 8.69236 8.67873 8.6209 8.54715 8.50667
## Proportion of Variance 0.00156 0.00154 0.00153 0.00153 0.0015 0.00148 0.00147
## Cumulative Proportion 0.98284 0.98438 0.98591 0.98744 0.9889 0.99042 0.99188
## PC188 PC189 PC190 PC191 PC192 PC193 PC194
## Standard deviation 8.43597 8.3068 8.28179 8.19358 7.96104 7.84308 1.753e-13
## Proportion of Variance 0.00144 0.0014 0.00139 0.00136 0.00128 0.00125 0.000e+00
## Cumulative Proportion 0.99333 0.9947 0.99611 0.99747 0.99875 1.00000 1.000e+00
When we perform PCA and data got more columns than rows, the maximum number of PCs produced is the number of rows. As we can see 194 PCs were produced and first 2 PCs explain only 25% of variance. Normally it would not be good, but when we consider that initiall data got 49388 columns, it is not that bad. Let`s visualise PCA by rows and columns.
fviz_pca_var(PCA_1,col.var = 'steelblue' )
fviz_pca_ind(PCA_1,col.ind = df$type)
We can not say too much about first plot, there are many variables all over the plot. Looking at second plot we can clearly see 2 groups of cancer. We can easily distinguish them just from 2 dimension that explain 25% of data. There are like 5 controversial points that algorithms can have problems. Even though I am expecting >95% accuracy of classification just looking at this plot. Let`s draw some scree plots now.
p_1 <- fviz_eig(PCA_1, choice='eigenvalue')
p_2 <- fviz_eig(PCA_1)
grid.arrange(p_1,p_2,ncol=2)
From the first plot we can see that PCs got very high eigenvalues that drop exponentially, from the second plot we can see that after 6 dimensions variance explained is very small. Let`s now plot the cumulative variance plot.
ggplot() + geom_line(aes(x=1:194,y=summary(PCA_1)$importance[3,])) +
labs(y="Cumulative Variance", x="index",title="Variance explained") + theme_bw()
Plot is very unusual, most of the dimensions explain very small percentage of variance. Only first few PCs up to ~45% of cumulative variance explain some significant part of data.
eig.val<-get_eigenvalue(PCA_1)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 8.421804e+03 1.705302e+01 17.05302
## Dim.2 4.066770e+03 8.234661e+00 25.28768
## Dim.3 2.191429e+03 4.437348e+00 29.72503
## Dim.4 1.444310e+03 2.924534e+00 32.64956
## Dim.5 1.191161e+03 2.411940e+00 35.06150
## Dim.6 1.094471e+03 2.216156e+00 37.27766
## Dim.7 8.177902e+02 1.655915e+00 38.93357
## Dim.8 7.107925e+02 1.439259e+00 40.37283
## Dim.9 6.241581e+02 1.263836e+00 41.63667
## Dim.10 5.558430e+02 1.125507e+00 42.76217
## Dim.11 5.326034e+02 1.078450e+00 43.84062
## Dim.12 5.001742e+02 1.012785e+00 44.85341
## Dim.13 4.614303e+02 9.343343e-01 45.78774
## Dim.14 4.557580e+02 9.228486e-01 46.71059
## Dim.15 4.522577e+02 9.157610e-01 47.62635
## Dim.16 4.112057e+02 8.326361e-01 48.45899
## Dim.17 4.055001e+02 8.210832e-01 49.28007
## Dim.18 3.924214e+02 7.946004e-01 50.07467
## Dim.19 3.819912e+02 7.734807e-01 50.84815
## Dim.20 3.638859e+02 7.368200e-01 51.58497
## Dim.21 3.545045e+02 7.178240e-01 52.30280
## Dim.22 3.396883e+02 6.878230e-01 52.99062
## Dim.23 3.371393e+02 6.826616e-01 53.67328
## Dim.24 3.195798e+02 6.471060e-01 54.32039
## Dim.25 3.097296e+02 6.271607e-01 54.94755
## Dim.26 3.027162e+02 6.129595e-01 55.56051
## Dim.27 2.943521e+02 5.960233e-01 56.15653
## Dim.28 2.887350e+02 5.846495e-01 56.74118
## Dim.29 2.834705e+02 5.739896e-01 57.31517
## Dim.30 2.783215e+02 5.635636e-01 57.87873
## Dim.31 2.677882e+02 5.422351e-01 58.42097
## Dim.32 2.658066e+02 5.382226e-01 58.95919
## Dim.33 2.600025e+02 5.264701e-01 59.48566
## Dim.34 2.472805e+02 5.007097e-01 59.98637
## Dim.35 2.416921e+02 4.893940e-01 60.47577
## Dim.36 2.356111e+02 4.770807e-01 60.95285
## Dim.37 2.341826e+02 4.741882e-01 61.42704
## Dim.38 2.328577e+02 4.715055e-01 61.89854
## Dim.39 2.282022e+02 4.620787e-01 62.36062
## Dim.40 2.214661e+02 4.484391e-01 62.80906
## Dim.41 2.188581e+02 4.431581e-01 63.25222
## Dim.42 2.148440e+02 4.350301e-01 63.68725
## Dim.43 2.100750e+02 4.253736e-01 64.11262
## Dim.44 2.084427e+02 4.220684e-01 64.53469
## Dim.45 2.049664e+02 4.150294e-01 64.94972
## Dim.46 2.004003e+02 4.057837e-01 65.35550
## Dim.47 1.993835e+02 4.037248e-01 65.75923
## Dim.48 1.933819e+02 3.915723e-01 66.15080
## Dim.49 1.925993e+02 3.899877e-01 66.54079
## Dim.50 1.883514e+02 3.813863e-01 66.92217
## Dim.51 1.867895e+02 3.782237e-01 67.30040
## Dim.52 1.856085e+02 3.758322e-01 67.67623
## Dim.53 1.834891e+02 3.715406e-01 68.04777
## Dim.54 1.829237e+02 3.703958e-01 68.41817
## Dim.55 1.786453e+02 3.617326e-01 68.77990
## Dim.56 1.766523e+02 3.576972e-01 69.13759
## Dim.57 1.758661e+02 3.561051e-01 69.49370
## Dim.58 1.734272e+02 3.511668e-01 69.84487
## Dim.59 1.717399e+02 3.477502e-01 70.19262
## Dim.60 1.698128e+02 3.438480e-01 70.53646
## Dim.61 1.693709e+02 3.429533e-01 70.87942
## Dim.62 1.673357e+02 3.388323e-01 71.21825
## Dim.63 1.653100e+02 3.347305e-01 71.55298
## Dim.64 1.643799e+02 3.328471e-01 71.88583
## Dim.65 1.621039e+02 3.282386e-01 72.21407
## Dim.66 1.590827e+02 3.221210e-01 72.53619
## Dim.67 1.582083e+02 3.203504e-01 72.85654
## Dim.68 1.571512e+02 3.182100e-01 73.17475
## Dim.69 1.557279e+02 3.153280e-01 73.49008
## Dim.70 1.535861e+02 3.109912e-01 73.80107
## Dim.71 1.516535e+02 3.070780e-01 74.10815
## Dim.72 1.507962e+02 3.053420e-01 74.41349
## Dim.73 1.494694e+02 3.026554e-01 74.71614
## Dim.74 1.490040e+02 3.017131e-01 75.01786
## Dim.75 1.484530e+02 3.005974e-01 75.31845
## Dim.76 1.472397e+02 2.981405e-01 75.61659
## Dim.77 1.465821e+02 2.968091e-01 75.91340
## Dim.78 1.438168e+02 2.912097e-01 76.20461
## Dim.79 1.426705e+02 2.888885e-01 76.49350
## Dim.80 1.418032e+02 2.871324e-01 76.78063
## Dim.81 1.408631e+02 2.852288e-01 77.06586
## Dim.82 1.399033e+02 2.832853e-01 77.34915
## Dim.83 1.383167e+02 2.800726e-01 77.62922
## Dim.84 1.374264e+02 2.782700e-01 77.90749
## Dim.85 1.364756e+02 2.763447e-01 78.18383
## Dim.86 1.349867e+02 2.733298e-01 78.45716
## Dim.87 1.337808e+02 2.708881e-01 78.72805
## Dim.88 1.334810e+02 2.702810e-01 78.99833
## Dim.89 1.325805e+02 2.684576e-01 79.26679
## Dim.90 1.319016e+02 2.670830e-01 79.53387
## Dim.91 1.305184e+02 2.642821e-01 79.79816
## Dim.92 1.300448e+02 2.633232e-01 80.06148
## Dim.93 1.296992e+02 2.626235e-01 80.32410
## Dim.94 1.280853e+02 2.593554e-01 80.58346
## Dim.95 1.277157e+02 2.586070e-01 80.84207
## Dim.96 1.271510e+02 2.574637e-01 81.09953
## Dim.97 1.261501e+02 2.554370e-01 81.35497
## Dim.98 1.249431e+02 2.529930e-01 81.60796
## Dim.99 1.244533e+02 2.520012e-01 81.85996
## Dim.100 1.238950e+02 2.508707e-01 82.11083
## Dim.101 1.235715e+02 2.502157e-01 82.36105
## Dim.102 1.228819e+02 2.488192e-01 82.60987
## Dim.103 1.219674e+02 2.469675e-01 82.85683
## Dim.104 1.206095e+02 2.442179e-01 83.10105
## Dim.105 1.198522e+02 2.426846e-01 83.34374
## Dim.106 1.195632e+02 2.420993e-01 83.58584
## Dim.107 1.186360e+02 2.402220e-01 83.82606
## Dim.108 1.165047e+02 2.359063e-01 84.06196
## Dim.109 1.162746e+02 2.354404e-01 84.29740
## Dim.110 1.157138e+02 2.343049e-01 84.53171
## Dim.111 1.154137e+02 2.336972e-01 84.76541
## Dim.112 1.148742e+02 2.326048e-01 84.99801
## Dim.113 1.143256e+02 2.314939e-01 85.22950
## Dim.114 1.134911e+02 2.298042e-01 85.45931
## Dim.115 1.133064e+02 2.294302e-01 85.68874
## Dim.116 1.126420e+02 2.280848e-01 85.91682
## Dim.117 1.118945e+02 2.265712e-01 86.14340
## Dim.118 1.110687e+02 2.248992e-01 86.36829
## Dim.119 1.108607e+02 2.244781e-01 86.59277
## Dim.120 1.097105e+02 2.221489e-01 86.81492
## Dim.121 1.092095e+02 2.211346e-01 87.03606
## Dim.122 1.088052e+02 2.203160e-01 87.25637
## Dim.123 1.084565e+02 2.196097e-01 87.47598
## Dim.124 1.073927e+02 2.174557e-01 87.69344
## Dim.125 1.068114e+02 2.162788e-01 87.90972
## Dim.126 1.064094e+02 2.154646e-01 88.12518
## Dim.127 1.049689e+02 2.125479e-01 88.33773
## Dim.128 1.047446e+02 2.120938e-01 88.54982
## Dim.129 1.045025e+02 2.116034e-01 88.76143
## Dim.130 1.039470e+02 2.104786e-01 88.97190
## Dim.131 1.036958e+02 2.099700e-01 89.18187
## Dim.132 1.027269e+02 2.080082e-01 89.38988
## Dim.133 1.022269e+02 2.069958e-01 89.59688
## Dim.134 1.016185e+02 2.057639e-01 89.80264
## Dim.135 1.009673e+02 2.044452e-01 90.00709
## Dim.136 1.004836e+02 2.034658e-01 90.21055
## Dim.137 9.984193e+01 2.021665e-01 90.41272
## Dim.138 9.918854e+01 2.008434e-01 90.61356
## Dim.139 9.866646e+01 1.997863e-01 90.81335
## Dim.140 9.816277e+01 1.987664e-01 91.01212
## Dim.141 9.786431e+01 1.981621e-01 91.21028
## Dim.142 9.772763e+01 1.978853e-01 91.40816
## Dim.143 9.702757e+01 1.964678e-01 91.60463
## Dim.144 9.691674e+01 1.962433e-01 91.80087
## Dim.145 9.605310e+01 1.944946e-01 91.99537
## Dim.146 9.542057e+01 1.932138e-01 92.18858
## Dim.147 9.514227e+01 1.926503e-01 92.38123
## Dim.148 9.455442e+01 1.914600e-01 92.57269
## Dim.149 9.391986e+01 1.901751e-01 92.76287
## Dim.150 9.326271e+01 1.888444e-01 92.95171
## Dim.151 9.205872e+01 1.864065e-01 93.13812
## Dim.152 9.175480e+01 1.857911e-01 93.32391
## Dim.153 9.170303e+01 1.856863e-01 93.50960
## Dim.154 9.125065e+01 1.847703e-01 93.69437
## Dim.155 9.051934e+01 1.832895e-01 93.87766
## Dim.156 8.941693e+01 1.810572e-01 94.05871
## Dim.157 8.865029e+01 1.795049e-01 94.23822
## Dim.158 8.846367e+01 1.791270e-01 94.41735
## Dim.159 8.824251e+01 1.786792e-01 94.59602
## Dim.160 8.786016e+01 1.779050e-01 94.77393
## Dim.161 8.750393e+01 1.771837e-01 94.95111
## Dim.162 8.649843e+01 1.751477e-01 95.12626
## Dim.163 8.649364e+01 1.751380e-01 95.30140
## Dim.164 8.623412e+01 1.746125e-01 95.47601
## Dim.165 8.536692e+01 1.728565e-01 95.64887
## Dim.166 8.480570e+01 1.717201e-01 95.82059
## Dim.167 8.459615e+01 1.712958e-01 95.99188
## Dim.168 8.427735e+01 1.706503e-01 96.16253
## Dim.169 8.364465e+01 1.693691e-01 96.33190
## Dim.170 8.330717e+01 1.686858e-01 96.50059
## Dim.171 8.288001e+01 1.678209e-01 96.66841
## Dim.172 8.251160e+01 1.670749e-01 96.83549
## Dim.173 8.203175e+01 1.661032e-01 97.00159
## Dim.174 8.135533e+01 1.647336e-01 97.16632
## Dim.175 8.111856e+01 1.642542e-01 97.33058
## Dim.176 8.026267e+01 1.625211e-01 97.49310
## Dim.177 7.982859e+01 1.616422e-01 97.65474
## Dim.178 7.862510e+01 1.592052e-01 97.81394
## Dim.179 7.771145e+01 1.573552e-01 97.97130
## Dim.180 7.726159e+01 1.564443e-01 98.12774
## Dim.181 7.711649e+01 1.561505e-01 98.28389
## Dim.182 7.611813e+01 1.541290e-01 98.43802
## Dim.183 7.555704e+01 1.529928e-01 98.59102
## Dim.184 7.532035e+01 1.525136e-01 98.74353
## Dim.185 7.432012e+01 1.504882e-01 98.89402
## Dim.186 7.305381e+01 1.479241e-01 99.04194
## Dim.187 7.236339e+01 1.465261e-01 99.18847
## Dim.188 7.116562e+01 1.441008e-01 99.33257
## Dim.189 6.900235e+01 1.397205e-01 99.47229
## Dim.190 6.858803e+01 1.388815e-01 99.61117
## Dim.191 6.713481e+01 1.359390e-01 99.74711
## Dim.192 6.337820e+01 1.283323e-01 99.87544
## Dim.193 6.151393e+01 1.245574e-01 100.00000
## Dim.194 3.072873e-26 6.222155e-29 100.00000
I decided that I will use only those PCs that explain more than 1% of variance.
Our data is balanced, so we will take 15 random samples from each type of cancer as the test data. The best approach would be to perform k-fold cross validation but with 5 types of models it could take too much time.
set.seed(123)
df %>% dplyr::select(samples,type) %>% filter(type=='adenocarcinoma') %>% pull(samples) %>%
sample(15) ->t_1
df %>% dplyr::select(samples,type) %>% filter(type=='normal') %>% pull(samples) %>%
sample(15) ->t_2
train <- df %>% filter(!samples %in% c(t_1,t_2))
test <- df %>% filter(samples %in% c(t_1,t_2))
train_X <- train %>% dplyr::select(-samples,-type)
train_Y <- train %>% dplyr::select(type)
test_X <- test %>% dplyr::select(-samples,-type)
test_Y <- test %>% dplyr::select(type)
Let`s perform also PCA on data and visualise which sample was taken as test data. As I said previously I will take only PCs with Variance explained greater than 1%. It turned out to be first 12 PCs
X_PCA<-prcomp(train_X,center=T,scale.=T)
test_X_N <- predict(X_PCA,test_X)
train_X_N <- X_PCA$x[,1:12]
test_X_N <- test_X_N[,1:12]
plot(train_X_N[,1:2],col='black',cex=1.5)
points(test_X_N[,1:2],pch=21,bg="purple",cex=1.5)
legend('bottomright',legend=c('train','test'),fill=c('black','purple'))
As we can see on the plot almost all test data are in the center of groups. Most of the algorithms will probably have no problem classifying them. Now I will create matrices to store computing time and accuracy values.
df_acc <- data.frame( train_test = character(), CLARA=numeric(),CLARA_PCA=numeric(),
KMEANS=numeric(),KMEANS_PCA=numeric(),
HCLUST=numeric(),HCLUST_PCA=numeric(),SVM=numeric(),SVM_PCA=numeric(),
RF=numeric(),RF_PCA=numeric())
df_acc[1:2,2:11] <- 0
df_acc$train_test <- c('train','test')
#creating matrix to store time computing
df_time <- data.frame( CLARA=numeric(),CLARA_PCA=numeric(),
KMEANS=numeric(),KMEANS_PCA=numeric(),
HCLUST=numeric(),HCLUST_PCA=numeric(),SVM=numeric(),SVM_PCA=numeric(),
RF=numeric(),RF_PCA=numeric())
df_time[1,1:10] <- 0
rownames(df_time) <- 'train_time'
First 3 algorithms that will be used are clustering methods. They group objects in a way that in the same group(cluster) observations are more similar to each other than to those in other groups. First method - CLARA(Clustering Large Applications) is an extension to k-medoids (PAM) methods to deal with data containing a large number of objects in order to reduce computing time. This is achieved using the sampling approach. Next method is k-means. This is the most popular clustering method that partition observations to k clusters in which observations belong to the cluster with the nearest mean (centroid). The last clustering method is hclust - hierarchical clustering that seeks to build a hierarchy of clusters. Other two methods are supervised algorithms. Support Vector Machines(SVM) uses linear or non-linear classifier that separates the space, into two or more regions that are called classes. Random forest is an ensemble method that trains several decision trees in parallel with bootstrapping followed by aggregation. Now I am going to perform each model on raw data and dimension reduced data.
Training
ptm_1 <- proc.time()
clara_train <- eclust(train_X,"clara", hc_metric="euclidean",k=2,graph=F)
t_1 <- proc.time() - ptm_1
table_1 <- table(clara_train$cluster,train_Y$type)
table_1
##
## adenocarcinoma normal
## 1 6 82
## 2 76 0
score_clara_train <- sum(apply(table_1, 2, max))/164
df_acc[1,"CLARA"] <- round(score_clara_train*100,2)
df_time[1] <- t_1[3]
paste('Accuracy on training data:', score_clara_train)
## [1] "Accuracy on training data: 0.963414634146341"
paste('Model trained in:' , t_1[3] ,' sec.')
## [1] "Model trained in: 29.97 sec."
Test
clara_test_kcca<-as.kcca(clara_train, train_X)
clara_pred<-predict(clara_test_kcca, test_X)
table_11 <- table(clara_pred,test_Y$type)
table_11
##
## clara_pred adenocarcinoma normal
## 1 0 15
## 2 15 0
score_clara_test <- sum(apply(table_11, 2, max))/30
df_acc[2,"CLARA"] <- round(score_clara_test*100,2)
paste('Accuracy on test data:', score_clara_test)
## [1] "Accuracy on test data: 1"
Training
ptm_2 <- proc.time()
kmeans_train <- eclust(train_X,"kmeans", hc_metric="euclidean",k=2,graph=F)
t_2 <- proc.time() - ptm_2
table_2 <- table(kmeans_train$cluster,train_Y$type)
table_2
##
## adenocarcinoma normal
## 1 80 0
## 2 2 82
score_kmeans_train <- sum(apply(table_2, 2, max))/164
df_acc[1,"KMEANS"] <- round(score_kmeans_train*100,2)
df_time[3] <- t_2[3]
paste('Accuracy on training data:', score_kmeans_train)
## [1] "Accuracy on training data: 0.98780487804878"
paste('Model trained in:' , t_2[3] ,' sec.')
## [1] "Model trained in: 24.27 sec."
Test
kmeans_test_kcca<-as.kcca(kmeans_train, train_X)
kmeans_pred<-predict(kmeans_test_kcca, test_X)
table_21 <- table(kmeans_pred,test_Y$type)
table_21
##
## kmeans_pred adenocarcinoma normal
## 1 15 0
## 2 0 15
kmeans_clara_test <- sum(apply(table_21, 2, max))/30
df_acc[2,"KMEANS"] <- round(kmeans_clara_test*100,2)
paste('Accuracy on test data:', kmeans_clara_test)
## [1] "Accuracy on test data: 1"
Training
ptm_3 <- proc.time()
hclust_train <- eclust(train_X,"hclust", hc_metric="euclidean",k=2,graph=F)
t_3 <- proc.time() - ptm_3
table_3 <- table(hclust_train$cluster,train_Y$type)
table_3
##
## adenocarcinoma normal
## 1 1 80
## 2 81 2
score_hclust_train <- sum(apply(table_3, 2, max))/164
df_acc[1,"HCLUST"] <- round(score_hclust_train*100,2)
df_time[5] <- t_3[3]
paste('Accuracy on training data:', score_hclust_train)
## [1] "Accuracy on training data: 0.981707317073171"
paste('Model trained in:' , t_3[3] ,' sec.')
## [1] "Model trained in: 29.19 sec."
Test
groups<-cutree(hclust_train,k=2)
hclust_pred<-knn(train_X, test_X,k=1,cl=groups)
table_31 <- table(hclust_pred,test_Y$type)
table_31
##
## hclust_pred adenocarcinoma normal
## 1 0 15
## 2 15 0
kmeans_hclust_test <- sum(apply(table_31, 2, max))/30
df_acc[2,"HCLUST"] <- round(kmeans_hclust_test*100,2)
paste('Accuracy on test data:', kmeans_hclust_test)
## [1] "Accuracy on test data: 1"
Training
ptm_4 <- proc.time()
model4 <- svm(x=train_X,y=factor(train_Y$type), kernel = "linear", scale = FALSE)
t_4 <- proc.time() - ptm_4
table_4 <- table(model4$fitted,train_Y$type)
table_4
##
## adenocarcinoma normal
## adenocarcinoma 82 0
## normal 0 82
svm_train <- sum(apply(table_4, 2, max))/164
df_acc[1,"SVM"] <- round(svm_train*100,2)
df_time[7] <- t_4[3]
paste('Accuracy on training data:', svm_train)
## [1] "Accuracy on training data: 1"
paste('Model trained in:' , t_4[3] ,' sec.')
## [1] "Model trained in: 4.69 sec."
Test
model4_predict <- predict(model4,test_X)
table_41 <- table(model4_predict,test_Y$type)
table_41
##
## model4_predict adenocarcinoma normal
## adenocarcinoma 15 0
## normal 0 15
svm_test <- sum(apply(table_41, 2, max))/30
df_acc[2,"SVM"] <- round(svm_test*100,2)
paste('Accuracy on test data:', svm_test)
## [1] "Accuracy on test data: 1"
Training
ptm_5 <- proc.time()
model5 <- randomForest(x=train_X,y=factor(train_Y$type))
t_5 <- proc.time() - ptm_5
table_5 <- table(model5$predicted,train_Y$type)
table_5
##
## adenocarcinoma normal
## adenocarcinoma 79 1
## normal 3 81
rf_train <- sum(apply(table_5, 2, max))/164
df_acc[1,"RF"] <- round(rf_train*100,2)
df_time[9] <- t_5[3]
paste('Accuracy on training data:', rf_train)
## [1] "Accuracy on training data: 0.975609756097561"
paste('Model trained in:' , t_5[3] ,' sec.')
## [1] "Model trained in: 77.95 sec."
Test
model5_predict <- predict(model5,test_X)
table_51 <- table(model5_predict,test_Y$type)
table_51
##
## model5_predict adenocarcinoma normal
## adenocarcinoma 15 0
## normal 0 15
rf_test <- sum(apply(table_51, 2, max))/30
df_acc[2,"RF"] <- round(rf_test*100,2)
paste('Accuracy on test data:', rf_test)
## [1] "Accuracy on test data: 1"
Training
ptm_6 <- proc.time()
clara_train_pca <- eclust(train_X_N,"clara", hc_metric="euclidean",k=2,graph = F)
t_6 <- proc.time() - ptm_6
table_1_pca <- table(clara_train_pca$cluster,train_Y$type)
table_1_pca
##
## adenocarcinoma normal
## 1 3 81
## 2 79 1
score_clara_train_pca <- sum(apply(table_1_pca, 2, max))/164
df_acc[1,"CLARA_PCA"] <- round(score_clara_train_pca*100,2)
df_time[2] <- t_6[3]
paste('Accuracy on test data:', score_clara_train_pca)
## [1] "Accuracy on test data: 0.975609756097561"
paste('Model trained in:' , t_6[3] ,' sec.')
## [1] "Model trained in: 0.00999999999999091 sec."
Test
clara_test_kcca_pca<-as.kcca(clara_train_pca, train_X_N)
clara_pred_pca<-predict(clara_test_kcca_pca, test_X_N)
table_11_pca <- table(clara_pred_pca,test_Y$type)
table_11_pca
##
## clara_pred_pca adenocarcinoma normal
## 1 0 15
## 2 15 0
score_clara_test_pca <- sum(apply(table_11_pca, 2, max))/30
df_acc[2,"CLARA_PCA"] <- round(score_clara_test_pca*100,2)
paste('Accuracy on test data:', score_clara_test_pca)
## [1] "Accuracy on test data: 1"
Training
ptm_7 <- proc.time()
kmeans_train_pca <- eclust(train_X_N,"kmeans", hc_metric="euclidean",k=2,graph=F)
t_7 <- proc.time() - ptm_7
table_2_pca <- table(kmeans_train_pca$cluster,train_Y$type)
table_2_pca
##
## adenocarcinoma normal
## 1 78 0
## 2 4 82
score_kmeans_train_pca <- sum(apply(table_2_pca, 2, max))/164
df_acc[1,"KMEANS_PCA"] <- round(score_kmeans_train_pca*100,2)
df_time[4] <- t_7[3]
paste('Accuracy on test data:', score_kmeans_train_pca)
## [1] "Accuracy on test data: 0.975609756097561"
paste('Model trained in:' , t_7[3] ,' sec.')
## [1] "Model trained in: 0 sec."
Test
kmeans_test_kcca_pca<-as.kcca(kmeans_train_pca, train_X_N)
kmeans_pred_pca<-predict(kmeans_test_kcca_pca, test_X_N)
table_21_pca <- table(kmeans_pred_pca,test_Y$type)
table_21_pca
##
## kmeans_pred_pca adenocarcinoma normal
## 1 15 0
## 2 0 15
score_kmeans_test_pca <- sum(apply(table_21_pca, 2, max))/30
df_acc[2,"KMEANS_PCA"] <- round(score_kmeans_test_pca*100,2)
paste('Accuracy on test data:', score_kmeans_test_pca)
## [1] "Accuracy on test data: 1"
Training
ptm_8 <- proc.time()
hclust_train_pca <- eclust(train_X_N,"hclust", hc_metric="euclidean",k=2,graph=F)
t_8 <- proc.time() - ptm_8
table_3_pca <- table(hclust_train_pca$cluster,train_Y$type)
table_3_pca
##
## adenocarcinoma normal
## 1 2 82
## 2 80 0
score_hclust_train_pca <- sum(apply(table_3_pca, 2, max))/164
df_acc[1,"HCLUST_PCA"] <- round(score_hclust_train_pca*100,2)
df_time[6] <- t_8[3]
paste('Accuracy on test data:', score_hclust_train_pca)
## [1] "Accuracy on test data: 0.98780487804878"
paste('Model trained in:' , t_8[3] ,' sec.')
## [1] "Model trained in: 0.0100000000000193 sec."
Test
groups_N<-cutree(hclust_train_pca,k=2)
hclust_pred_N<-knn(train_X_N, test_X_N,k=1,cl=groups_N)
table_31_pca <- table(hclust_pred_N,test_Y$type)
table_31_pca
##
## hclust_pred_N adenocarcinoma normal
## 1 0 15
## 2 15 0
score_hclust_test_PCA <- sum(apply(table_31_pca, 2, max))/30
df_acc[2,"HCLUST_PCA"] <- round(score_hclust_test_PCA*100,2)
paste('Accuracy on test data:', score_hclust_test_PCA)
## [1] "Accuracy on test data: 1"
Training
ptm_9 <- proc.time()
model4_PCA <- svm(x=train_X_N,y=factor(train_Y$type), kernel = "linear", scale = FALSE)
t_9 <- proc.time() - ptm_9
table_4_PCA <- table(model4_PCA$fitted,train_Y$type)
table_4_PCA
##
## adenocarcinoma normal
## adenocarcinoma 82 0
## normal 0 82
svm_train_PCA <- sum(apply(table_4_PCA, 2, max))/164
df_acc[1,"SVM_PCA"] <- round(svm_train_PCA*100,2)
df_time[8] <- t_9[3]
paste('Accuracy on test data:', svm_train_PCA)
## [1] "Accuracy on test data: 1"
paste('Model trained in:' , t_9[3] ,' sec.')
## [1] "Model trained in: 0 sec."
Test
model4_predict_PCA <- predict(model4_PCA,test_X_N)
table_41_PCA <- table(model4_predict_PCA,test_Y$type)
table_41_PCA
##
## model4_predict_PCA adenocarcinoma normal
## adenocarcinoma 15 0
## normal 0 15
svm_test_PCA <- sum(apply(table_41_PCA, 2, max))/30
df_acc[2,"SVM_PCA"] <- round(svm_test_PCA*100,2)
paste('Accuracy on test data:', svm_test_PCA)
## [1] "Accuracy on test data: 1"
Training
ptm_10 <- proc.time()
model5_PCA <- randomForest(x=train_X_N,y=factor(train_Y$type))
t_10 <- proc.time() - ptm_10
table_5_PCA <- table(model5_PCA$predicted,train_Y$type)
table_5_PCA
##
## adenocarcinoma normal
## adenocarcinoma 81 1
## normal 1 81
rf_train_PCA <- sum(apply(table_5_PCA, 2, max))/164
df_acc[1,"RF_PCA"] <- round(rf_train_PCA*100,2)
df_time[10] <- t_10[3]
paste('Accuracy on test data:', rf_train_PCA)
## [1] "Accuracy on test data: 0.98780487804878"
paste('Model trained in:' , t_10[3] ,' sec.')
## [1] "Model trained in: 0.0499999999999829 sec."
Test
model5_predict_PCA <- predict(model5_PCA,test_X_N)
table_51_PCA <- table(model5_predict_PCA,test_Y$type)
table_51_PCA
##
## model5_predict_PCA adenocarcinoma normal
## adenocarcinoma 15 0
## normal 0 15
rf_test_PCA <- sum(apply(table_51_PCA, 2, max))/30
df_acc[2,"RF_PCA"] <- round(rf_test_PCA*100,2)
paste('Accuracy on test data:', rf_test_PCA)
## [1] "Accuracy on test data: 1"
At first glance we can see that computing time lowered drastically. Only random forest needs few seconds to run, all other algorithms run in the nick of time. When it comes to accuracy, one more time all algorithms predict in 100% test data. Classification on training data is also good, CLara and Kmeans classifed 4 observations wrong, Hclust and Random Forest 2 wrong and SVM classified everything good.
Firstly let`s print table of computation times.
df_time %>% kable()
| CLARA | CLARA_PCA | KMEANS | KMEANS_PCA | HCLUST | HCLUST_PCA | SVM | SVM_PCA | RF | RF_PCA | |
|---|---|---|---|---|---|---|---|---|---|---|
| train_time | 29.97 | 0.01 | 24.27 | 0 | 29.19 | 0.01 | 4.69 | 0 | 77.95 | 0.05 |
As we can see there is incredible difference between running time of algorithms on raw data vs on data after PCA. Let`s look at accuracy matrix now.
df_acc %>% kable()
| train_test | CLARA | CLARA_PCA | KMEANS | KMEANS_PCA | HCLUST | HCLUST_PCA | SVM | SVM_PCA | RF | RF_PCA |
|---|---|---|---|---|---|---|---|---|---|---|
| train | 96.34 | 97.56 | 98.78 | 97.56 | 98.17 | 98.78 | 100 | 100 | 97.56 | 98.78 |
| test | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100 | 100 | 100.00 | 100.00 |
We can see that accuracy even improved on PCA. Only Kmeans perform worse after PCA. This is great information, we did not have trade off between computing time and accuracy.
Summary
Goal of this paper was to check how dimension reduction with PCA influence data classification procedure. We can definitely see that running time of algorithms drops drastically. Although making statements about accuracy is tricky here. It seems like PCA does not have any negative influence on predictions and classifications. Yet metodology was not perfect here, k-fold cross validation should be performed. Data itself is not ordinary, there are more columns than rows. Moreover SVM seems to be the best classification method in comparison to other on used data.
Bibliography
https://stanford.edu/~cpiech/cs221/handouts/kmeans.html
https://www.datanovia.com/en/lessons/clara-in-r-clustering-large-applications/
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/hclust
https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
https://stackoverflow.com/questions/42325276/how-to-use-pca-on-test-set-code
Comments about classification on raw data
As we can see all algorithms did not have any problems predicting test data. When it comes to training data the best algorithm is SVM, it classified whole training set correctly. This is not a surprise, as SVM tends to overfit data and we got many columns. Clara fitted the data worst, there were 6 observations wrongly classified. RF missclasified 3, Kmeans 2 and Hclust 3. When it comes to computing time SVM is the winner, it only took few seconds to run. Classification methods needed from 20 to 30 seconds and surprisingly for me RF was the slowest, it took more than 1 minute to run.