Introduction

In this article I would like to check how dimension reduction influence data classification, by examining how prediction accuracy and model computing time differ before and after applying Principal Component Analysis (PCA). Research will be conducted on colorectal cancer data from Curated Microarray Database (CuMiDa). Article is divided into 2 parts. First part is about exploratory data analysis, where mostly I will use PCA to examine the data. Second part is about fitting and predicting 5 types of models – CLARA, Kmeans, Hierarchical Clustering, SVM and Random Forest. First 3 methods are unsupervised learning, clustering algorithms, where SVM and Random Forest are supervised methods, widely used in many fields that produce very often good results.

Exploratory Data Analysis

Reading libraries

library(data.table)
library(randomForest)
library(factoextra)
library(e1071)
library(flexclust)
library(tidyverse)
library(class)
library(clusterSim)
library(gridExtra)
library(ggplot2)
library(knitr)

Let`s read the data and check dimensionality.

df <- fread('Colorectal_GSE44076.csv')
dim(df)
## [1]   194 49388

As we can see there are 194 rows and 49388 columns. It is not good that data have more columns than rows, but in our case we want to show how pca affect classification, the more columns to reduce the better. Let`s summarise first 6 columns and check how some part of the data looks.

summary(df[,1:6]) %>% kable()
samples type 11715100_at 11715101_s_at 11715102_x_at 11715103_x_at
Min. :648.0 Length:194 Min. :2.313 Min. :3.035 Min. :2.399 Min. :2.464
1st Qu.:696.2 Class :character 1st Qu.:3.019 1st Qu.:3.844 1st Qu.:3.096 1st Qu.:3.395
Median :745.5 Mode :character Median :3.275 Median :4.094 Median :3.399 Median :3.637
Mean :745.3 NA Mean :3.402 Mean :4.177 Mean :3.479 Mean :3.673
3rd Qu.:794.8 NA 3rd Qu.:3.601 3rd Qu.:4.419 3rd Qu.:3.682 3rd Qu.:3.846
Max. :843.0 NA Max. :6.209 Max. :6.033 Max. :5.927 Max. :5.434

As we can see first column ‘samples’ consist of unique values of data, ‘type’ column is the type of cancer and all other columns are numeric description of each cancer value. Let`s check the types of cancer.

table(df$type)
## 
## adenocarcinoma         normal 
##             97             97

Perfect we got balanced data. There are 2 types of cancer: adenocarcinoma and normal. Let`s check if there are any missing values.

sum(is.na(df))
## [1] 0
There are no missing values. Let`s try to explore our data using PCA. I will only use PCA because MDS algorithm is not optimized, it took too much time to finish and my patience ran out.
PCA_1 <- prcomp(df[,3:ncol(df)],center=T,scale.=T)
summary(PCA_1)
## Importance of components:
##                            PC1      PC2      PC3      PC4      PC5      PC6
## Standard deviation     91.7704 63.77123 46.81270 38.00408 34.51319 33.08278
## Proportion of Variance  0.1705  0.08235  0.04437  0.02925  0.02412  0.02216
## Cumulative Proportion   0.1705  0.25288  0.29725  0.32650  0.35062  0.37278
##                             PC7      PC8      PC9     PC10     PC11     PC12
## Standard deviation     28.59703 26.66069 24.98316 23.57632 23.07820 22.36457
## Proportion of Variance  0.01656  0.01439  0.01264  0.01126  0.01078  0.01013
## Cumulative Proportion   0.38934  0.40373  0.41637  0.42762  0.43841  0.44853
##                            PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     21.48093 21.34849 21.26635 20.27821 20.13703 19.80963
## Proportion of Variance  0.00934  0.00923  0.00916  0.00833  0.00821  0.00795
## Cumulative Proportion   0.45788  0.46711  0.47626  0.48459  0.49280  0.50075
##                            PC19     PC20     PC21     PC22     PC23     PC24
## Standard deviation     19.54459 19.07579 18.82829 18.43063 18.36135 17.87679
## Proportion of Variance  0.00773  0.00737  0.00718  0.00688  0.00683  0.00647
## Cumulative Proportion   0.50848  0.51585  0.52303  0.52991  0.53673  0.54320
##                            PC25     PC26     PC27     PC28     PC29     PC30
## Standard deviation     17.59914 17.39874 17.15669 16.99220 16.83658 16.68297
## Proportion of Variance  0.00627  0.00613  0.00596  0.00585  0.00574  0.00564
## Cumulative Proportion   0.54948  0.55561  0.56157  0.56741  0.57315  0.57879
##                            PC31     PC32     PC33     PC34     PC35     PC36
## Standard deviation     16.36424 16.30358 16.12459 15.72515 15.54645 15.34963
## Proportion of Variance  0.00542  0.00538  0.00526  0.00501  0.00489  0.00477
## Cumulative Proportion   0.58421  0.58959  0.59486  0.59986  0.60476  0.60953
##                            PC37     PC38     PC39     PC40     PC41     PC42
## Standard deviation     15.30302 15.25968 15.10636 14.88174 14.79385 14.65756
## Proportion of Variance  0.00474  0.00472  0.00462  0.00448  0.00443  0.00435
## Cumulative Proportion   0.61427  0.61899  0.62361  0.62809  0.63252  0.63687
##                            PC43     PC44     PC45     PC46     PC47     PC48
## Standard deviation     14.49396 14.43754 14.31665 14.15628 14.12032 13.90618
## Proportion of Variance  0.00425  0.00422  0.00415  0.00406  0.00404  0.00392
## Cumulative Proportion   0.64113  0.64535  0.64950  0.65356  0.65759  0.66151
##                           PC49     PC50     PC51     PC52     PC53    PC54
## Standard deviation     13.8780 13.72412 13.66710 13.62382 13.54581 13.5249
## Proportion of Variance  0.0039  0.00381  0.00378  0.00376  0.00372  0.0037
## Cumulative Proportion   0.6654  0.66922  0.67300  0.67676  0.68048  0.6842
##                            PC55     PC56     PC57     PC58     PC59     PC60
## Standard deviation     13.36582 13.29106 13.26145 13.16918 13.10496 13.03122
## Proportion of Variance  0.00362  0.00358  0.00356  0.00351  0.00348  0.00344
## Cumulative Proportion   0.68780  0.69138  0.69494  0.69845  0.70193  0.70536
##                            PC61     PC62     PC63     PC64     PC65     PC66
## Standard deviation     13.01426 12.93583 12.85729 12.82107 12.73200 12.61280
## Proportion of Variance  0.00343  0.00339  0.00335  0.00333  0.00328  0.00322
## Cumulative Proportion   0.70879  0.71218  0.71553  0.71886  0.72214  0.72536
##                           PC67     PC68     PC69     PC70     PC71     PC72
## Standard deviation     12.5781 12.53599 12.47910 12.39299 12.31477 12.27991
## Proportion of Variance  0.0032  0.00318  0.00315  0.00311  0.00307  0.00305
## Cumulative Proportion   0.7286  0.73175  0.73490  0.73801  0.74108  0.74413
##                            PC73     PC74     PC75     PC76     PC77     PC78
## Standard deviation     12.22577 12.20672 12.18413 12.13424 12.10711 11.99237
## Proportion of Variance  0.00303  0.00302  0.00301  0.00298  0.00297  0.00291
## Cumulative Proportion   0.74716  0.75018  0.75318  0.75617  0.75913  0.76205
##                            PC79     PC80     PC81     PC82    PC83     PC84
## Standard deviation     11.94448 11.90811 11.86858 11.82807 11.7608 11.72290
## Proportion of Variance  0.00289  0.00287  0.00285  0.00283  0.0028  0.00278
## Cumulative Proportion   0.76494  0.76781  0.77066  0.77349  0.7763  0.77907
##                            PC85     PC86     PC87    PC88     PC89     PC90
## Standard deviation     11.68228 11.61838 11.56636 11.5534 11.51436 11.48484
## Proportion of Variance  0.00276  0.00273  0.00271  0.0027  0.00268  0.00267
## Cumulative Proportion   0.78184  0.78457  0.78728  0.7900  0.79267  0.79534
##                            PC91     PC92     PC93     PC94     PC95     PC96
## Standard deviation     11.42446 11.40372 11.38856 11.31748 11.30114 11.27613
## Proportion of Variance  0.00264  0.00263  0.00263  0.00259  0.00259  0.00257
## Cumulative Proportion   0.79798  0.80061  0.80324  0.80583  0.80842  0.81100
##                            PC97     PC98     PC99    PC100   PC101    PC102
## Standard deviation     11.23166 11.17780 11.15586 11.13081 11.1163 11.08521
## Proportion of Variance  0.00255  0.00253  0.00252  0.00251  0.0025  0.00249
## Cumulative Proportion   0.81355  0.81608  0.81860  0.82111  0.8236  0.82610
##                           PC103    PC104    PC105    PC106   PC107    PC108
## Standard deviation     11.04388 10.98223 10.94770 10.93449 10.8920 10.79373
## Proportion of Variance  0.00247  0.00244  0.00243  0.00242  0.0024  0.00236
## Cumulative Proportion   0.82857  0.83101  0.83344  0.83586  0.8383  0.84062
##                           PC109    PC110    PC111    PC112    PC113   PC114
## Standard deviation     10.78307 10.75704 10.74308 10.71794 10.69231 10.6532
## Proportion of Variance  0.00235  0.00234  0.00234  0.00233  0.00231  0.0023
## Cumulative Proportion   0.84297  0.84532  0.84765  0.84998  0.85230  0.8546
##                           PC115    PC116    PC117    PC118    PC119    PC120
## Standard deviation     10.64455 10.61329 10.57802 10.53892 10.52904 10.47428
## Proportion of Variance  0.00229  0.00228  0.00227  0.00225  0.00224  0.00222
## Cumulative Proportion   0.85689  0.85917  0.86143  0.86368  0.86593  0.86815
##                           PC121   PC122   PC123    PC124    PC125    PC126
## Standard deviation     10.45034 10.4310 10.4142 10.36304 10.33496 10.31549
## Proportion of Variance  0.00221  0.0022  0.0022  0.00217  0.00216  0.00215
## Cumulative Proportion   0.87036  0.8726  0.8748  0.87693  0.87910  0.88125
##                           PC127    PC128    PC129   PC130   PC131    PC132
## Standard deviation     10.24543 10.23448 10.22264 10.1954 10.1831 10.13543
## Proportion of Variance  0.00213  0.00212  0.00212  0.0021  0.0021  0.00208
## Cumulative Proportion   0.88338  0.88550  0.88761  0.8897  0.8918  0.89390
##                           PC133    PC134    PC135    PC136   PC137   PC138
## Standard deviation     10.11073 10.08060 10.04825 10.02415 9.99209 9.95934
## Proportion of Variance  0.00207  0.00206  0.00204  0.00203 0.00202 0.00201
## Cumulative Proportion   0.89597  0.89803  0.90007  0.90211 0.90413 0.90614
##                         PC139   PC140   PC141   PC142   PC143   PC144   PC145
## Standard deviation     9.9331 9.90771 9.89264 9.88573 9.85026 9.84463 9.80067
## Proportion of Variance 0.0020 0.00199 0.00198 0.00198 0.00196 0.00196 0.00194
## Cumulative Proportion  0.9081 0.91012 0.91210 0.91408 0.91605 0.91801 0.91995
##                          PC146   PC147   PC148  PC149   PC150   PC151   PC152
## Standard deviation     9.76835 9.75409 9.72391 9.6912 9.65726 9.59472 9.57887
## Proportion of Variance 0.00193 0.00193 0.00191 0.0019 0.00189 0.00186 0.00186
## Cumulative Proportion  0.92189 0.92381 0.92573 0.9276 0.92952 0.93138 0.93324
##                          PC153   PC154   PC155   PC156  PC157   PC158   PC159
## Standard deviation     9.57617 9.55252 9.51417 9.45605 9.4154 9.40551 9.39375
## Proportion of Variance 0.00186 0.00185 0.00183 0.00181 0.0018 0.00179 0.00179
## Cumulative Proportion  0.93510 0.93694 0.93878 0.94059 0.9424 0.94417 0.94596
##                          PC160   PC161   PC162   PC163   PC164   PC165   PC166
## Standard deviation     9.37338 9.35435 9.30045 9.30020 9.28623 9.23942 9.20900
## Proportion of Variance 0.00178 0.00177 0.00175 0.00175 0.00175 0.00173 0.00172
## Cumulative Proportion  0.94774 0.94951 0.95126 0.95301 0.95476 0.95649 0.95821
##                          PC167   PC168   PC169   PC170   PC171   PC172   PC173
## Standard deviation     9.19762 9.18027 9.14574 9.12728 9.10385 9.08359 9.05714
## Proportion of Variance 0.00171 0.00171 0.00169 0.00169 0.00168 0.00167 0.00166
## Cumulative Proportion  0.95992 0.96163 0.96332 0.96501 0.96668 0.96835 0.97002
##                          PC174   PC175   PC176   PC177   PC178   PC179   PC180
## Standard deviation     9.01972 9.00658 8.95894 8.93468 8.86708 8.81541 8.78986
## Proportion of Variance 0.00165 0.00164 0.00163 0.00162 0.00159 0.00157 0.00156
## Cumulative Proportion  0.97166 0.97331 0.97493 0.97655 0.97814 0.97971 0.98128
##                          PC181   PC182   PC183   PC184  PC185   PC186   PC187
## Standard deviation     8.78160 8.72457 8.69236 8.67873 8.6209 8.54715 8.50667
## Proportion of Variance 0.00156 0.00154 0.00153 0.00153 0.0015 0.00148 0.00147
## Cumulative Proportion  0.98284 0.98438 0.98591 0.98744 0.9889 0.99042 0.99188
##                          PC188  PC189   PC190   PC191   PC192   PC193     PC194
## Standard deviation     8.43597 8.3068 8.28179 8.19358 7.96104 7.84308 1.753e-13
## Proportion of Variance 0.00144 0.0014 0.00139 0.00136 0.00128 0.00125 0.000e+00
## Cumulative Proportion  0.99333 0.9947 0.99611 0.99747 0.99875 1.00000 1.000e+00

When we perform PCA and data got more columns than rows, the maximum number of PCs produced is the number of rows. As we can see 194 PCs were produced and first 2 PCs explain only 25% of variance. Normally it would not be good, but when we consider that initiall data got 49388 columns, it is not that bad. Let`s visualise PCA by rows and columns.

fviz_pca_var(PCA_1,col.var = 'steelblue' )

fviz_pca_ind(PCA_1,col.ind = df$type)

We can not say too much about first plot, there are many variables all over the plot. Looking at second plot we can clearly see 2 groups of cancer. We can easily distinguish them just from 2 dimension that explain 25% of data. There are like 5 controversial points that algorithms can have problems. Even though I am expecting >95% accuracy of classification just looking at this plot. Let`s draw some scree plots now.

p_1 <- fviz_eig(PCA_1, choice='eigenvalue')
p_2 <- fviz_eig(PCA_1)
grid.arrange(p_1,p_2,ncol=2)

From the first plot we can see that PCs got very high eigenvalues that drop exponentially, from the second plot we can see that after 6 dimensions variance explained is very small. Let`s now plot the cumulative variance plot.

ggplot() + geom_line(aes(x=1:194,y=summary(PCA_1)$importance[3,])) + 
  labs(y="Cumulative Variance", x="index",title="Variance explained") + theme_bw()

Plot is very unusual, most of the dimensions explain very small percentage of variance. Only first few PCs up to ~45% of cumulative variance explain some significant part of data.

eig.val<-get_eigenvalue(PCA_1)
eig.val
##           eigenvalue variance.percent cumulative.variance.percent
## Dim.1   8.421804e+03     1.705302e+01                    17.05302
## Dim.2   4.066770e+03     8.234661e+00                    25.28768
## Dim.3   2.191429e+03     4.437348e+00                    29.72503
## Dim.4   1.444310e+03     2.924534e+00                    32.64956
## Dim.5   1.191161e+03     2.411940e+00                    35.06150
## Dim.6   1.094471e+03     2.216156e+00                    37.27766
## Dim.7   8.177902e+02     1.655915e+00                    38.93357
## Dim.8   7.107925e+02     1.439259e+00                    40.37283
## Dim.9   6.241581e+02     1.263836e+00                    41.63667
## Dim.10  5.558430e+02     1.125507e+00                    42.76217
## Dim.11  5.326034e+02     1.078450e+00                    43.84062
## Dim.12  5.001742e+02     1.012785e+00                    44.85341
## Dim.13  4.614303e+02     9.343343e-01                    45.78774
## Dim.14  4.557580e+02     9.228486e-01                    46.71059
## Dim.15  4.522577e+02     9.157610e-01                    47.62635
## Dim.16  4.112057e+02     8.326361e-01                    48.45899
## Dim.17  4.055001e+02     8.210832e-01                    49.28007
## Dim.18  3.924214e+02     7.946004e-01                    50.07467
## Dim.19  3.819912e+02     7.734807e-01                    50.84815
## Dim.20  3.638859e+02     7.368200e-01                    51.58497
## Dim.21  3.545045e+02     7.178240e-01                    52.30280
## Dim.22  3.396883e+02     6.878230e-01                    52.99062
## Dim.23  3.371393e+02     6.826616e-01                    53.67328
## Dim.24  3.195798e+02     6.471060e-01                    54.32039
## Dim.25  3.097296e+02     6.271607e-01                    54.94755
## Dim.26  3.027162e+02     6.129595e-01                    55.56051
## Dim.27  2.943521e+02     5.960233e-01                    56.15653
## Dim.28  2.887350e+02     5.846495e-01                    56.74118
## Dim.29  2.834705e+02     5.739896e-01                    57.31517
## Dim.30  2.783215e+02     5.635636e-01                    57.87873
## Dim.31  2.677882e+02     5.422351e-01                    58.42097
## Dim.32  2.658066e+02     5.382226e-01                    58.95919
## Dim.33  2.600025e+02     5.264701e-01                    59.48566
## Dim.34  2.472805e+02     5.007097e-01                    59.98637
## Dim.35  2.416921e+02     4.893940e-01                    60.47577
## Dim.36  2.356111e+02     4.770807e-01                    60.95285
## Dim.37  2.341826e+02     4.741882e-01                    61.42704
## Dim.38  2.328577e+02     4.715055e-01                    61.89854
## Dim.39  2.282022e+02     4.620787e-01                    62.36062
## Dim.40  2.214661e+02     4.484391e-01                    62.80906
## Dim.41  2.188581e+02     4.431581e-01                    63.25222
## Dim.42  2.148440e+02     4.350301e-01                    63.68725
## Dim.43  2.100750e+02     4.253736e-01                    64.11262
## Dim.44  2.084427e+02     4.220684e-01                    64.53469
## Dim.45  2.049664e+02     4.150294e-01                    64.94972
## Dim.46  2.004003e+02     4.057837e-01                    65.35550
## Dim.47  1.993835e+02     4.037248e-01                    65.75923
## Dim.48  1.933819e+02     3.915723e-01                    66.15080
## Dim.49  1.925993e+02     3.899877e-01                    66.54079
## Dim.50  1.883514e+02     3.813863e-01                    66.92217
## Dim.51  1.867895e+02     3.782237e-01                    67.30040
## Dim.52  1.856085e+02     3.758322e-01                    67.67623
## Dim.53  1.834891e+02     3.715406e-01                    68.04777
## Dim.54  1.829237e+02     3.703958e-01                    68.41817
## Dim.55  1.786453e+02     3.617326e-01                    68.77990
## Dim.56  1.766523e+02     3.576972e-01                    69.13759
## Dim.57  1.758661e+02     3.561051e-01                    69.49370
## Dim.58  1.734272e+02     3.511668e-01                    69.84487
## Dim.59  1.717399e+02     3.477502e-01                    70.19262
## Dim.60  1.698128e+02     3.438480e-01                    70.53646
## Dim.61  1.693709e+02     3.429533e-01                    70.87942
## Dim.62  1.673357e+02     3.388323e-01                    71.21825
## Dim.63  1.653100e+02     3.347305e-01                    71.55298
## Dim.64  1.643799e+02     3.328471e-01                    71.88583
## Dim.65  1.621039e+02     3.282386e-01                    72.21407
## Dim.66  1.590827e+02     3.221210e-01                    72.53619
## Dim.67  1.582083e+02     3.203504e-01                    72.85654
## Dim.68  1.571512e+02     3.182100e-01                    73.17475
## Dim.69  1.557279e+02     3.153280e-01                    73.49008
## Dim.70  1.535861e+02     3.109912e-01                    73.80107
## Dim.71  1.516535e+02     3.070780e-01                    74.10815
## Dim.72  1.507962e+02     3.053420e-01                    74.41349
## Dim.73  1.494694e+02     3.026554e-01                    74.71614
## Dim.74  1.490040e+02     3.017131e-01                    75.01786
## Dim.75  1.484530e+02     3.005974e-01                    75.31845
## Dim.76  1.472397e+02     2.981405e-01                    75.61659
## Dim.77  1.465821e+02     2.968091e-01                    75.91340
## Dim.78  1.438168e+02     2.912097e-01                    76.20461
## Dim.79  1.426705e+02     2.888885e-01                    76.49350
## Dim.80  1.418032e+02     2.871324e-01                    76.78063
## Dim.81  1.408631e+02     2.852288e-01                    77.06586
## Dim.82  1.399033e+02     2.832853e-01                    77.34915
## Dim.83  1.383167e+02     2.800726e-01                    77.62922
## Dim.84  1.374264e+02     2.782700e-01                    77.90749
## Dim.85  1.364756e+02     2.763447e-01                    78.18383
## Dim.86  1.349867e+02     2.733298e-01                    78.45716
## Dim.87  1.337808e+02     2.708881e-01                    78.72805
## Dim.88  1.334810e+02     2.702810e-01                    78.99833
## Dim.89  1.325805e+02     2.684576e-01                    79.26679
## Dim.90  1.319016e+02     2.670830e-01                    79.53387
## Dim.91  1.305184e+02     2.642821e-01                    79.79816
## Dim.92  1.300448e+02     2.633232e-01                    80.06148
## Dim.93  1.296992e+02     2.626235e-01                    80.32410
## Dim.94  1.280853e+02     2.593554e-01                    80.58346
## Dim.95  1.277157e+02     2.586070e-01                    80.84207
## Dim.96  1.271510e+02     2.574637e-01                    81.09953
## Dim.97  1.261501e+02     2.554370e-01                    81.35497
## Dim.98  1.249431e+02     2.529930e-01                    81.60796
## Dim.99  1.244533e+02     2.520012e-01                    81.85996
## Dim.100 1.238950e+02     2.508707e-01                    82.11083
## Dim.101 1.235715e+02     2.502157e-01                    82.36105
## Dim.102 1.228819e+02     2.488192e-01                    82.60987
## Dim.103 1.219674e+02     2.469675e-01                    82.85683
## Dim.104 1.206095e+02     2.442179e-01                    83.10105
## Dim.105 1.198522e+02     2.426846e-01                    83.34374
## Dim.106 1.195632e+02     2.420993e-01                    83.58584
## Dim.107 1.186360e+02     2.402220e-01                    83.82606
## Dim.108 1.165047e+02     2.359063e-01                    84.06196
## Dim.109 1.162746e+02     2.354404e-01                    84.29740
## Dim.110 1.157138e+02     2.343049e-01                    84.53171
## Dim.111 1.154137e+02     2.336972e-01                    84.76541
## Dim.112 1.148742e+02     2.326048e-01                    84.99801
## Dim.113 1.143256e+02     2.314939e-01                    85.22950
## Dim.114 1.134911e+02     2.298042e-01                    85.45931
## Dim.115 1.133064e+02     2.294302e-01                    85.68874
## Dim.116 1.126420e+02     2.280848e-01                    85.91682
## Dim.117 1.118945e+02     2.265712e-01                    86.14340
## Dim.118 1.110687e+02     2.248992e-01                    86.36829
## Dim.119 1.108607e+02     2.244781e-01                    86.59277
## Dim.120 1.097105e+02     2.221489e-01                    86.81492
## Dim.121 1.092095e+02     2.211346e-01                    87.03606
## Dim.122 1.088052e+02     2.203160e-01                    87.25637
## Dim.123 1.084565e+02     2.196097e-01                    87.47598
## Dim.124 1.073927e+02     2.174557e-01                    87.69344
## Dim.125 1.068114e+02     2.162788e-01                    87.90972
## Dim.126 1.064094e+02     2.154646e-01                    88.12518
## Dim.127 1.049689e+02     2.125479e-01                    88.33773
## Dim.128 1.047446e+02     2.120938e-01                    88.54982
## Dim.129 1.045025e+02     2.116034e-01                    88.76143
## Dim.130 1.039470e+02     2.104786e-01                    88.97190
## Dim.131 1.036958e+02     2.099700e-01                    89.18187
## Dim.132 1.027269e+02     2.080082e-01                    89.38988
## Dim.133 1.022269e+02     2.069958e-01                    89.59688
## Dim.134 1.016185e+02     2.057639e-01                    89.80264
## Dim.135 1.009673e+02     2.044452e-01                    90.00709
## Dim.136 1.004836e+02     2.034658e-01                    90.21055
## Dim.137 9.984193e+01     2.021665e-01                    90.41272
## Dim.138 9.918854e+01     2.008434e-01                    90.61356
## Dim.139 9.866646e+01     1.997863e-01                    90.81335
## Dim.140 9.816277e+01     1.987664e-01                    91.01212
## Dim.141 9.786431e+01     1.981621e-01                    91.21028
## Dim.142 9.772763e+01     1.978853e-01                    91.40816
## Dim.143 9.702757e+01     1.964678e-01                    91.60463
## Dim.144 9.691674e+01     1.962433e-01                    91.80087
## Dim.145 9.605310e+01     1.944946e-01                    91.99537
## Dim.146 9.542057e+01     1.932138e-01                    92.18858
## Dim.147 9.514227e+01     1.926503e-01                    92.38123
## Dim.148 9.455442e+01     1.914600e-01                    92.57269
## Dim.149 9.391986e+01     1.901751e-01                    92.76287
## Dim.150 9.326271e+01     1.888444e-01                    92.95171
## Dim.151 9.205872e+01     1.864065e-01                    93.13812
## Dim.152 9.175480e+01     1.857911e-01                    93.32391
## Dim.153 9.170303e+01     1.856863e-01                    93.50960
## Dim.154 9.125065e+01     1.847703e-01                    93.69437
## Dim.155 9.051934e+01     1.832895e-01                    93.87766
## Dim.156 8.941693e+01     1.810572e-01                    94.05871
## Dim.157 8.865029e+01     1.795049e-01                    94.23822
## Dim.158 8.846367e+01     1.791270e-01                    94.41735
## Dim.159 8.824251e+01     1.786792e-01                    94.59602
## Dim.160 8.786016e+01     1.779050e-01                    94.77393
## Dim.161 8.750393e+01     1.771837e-01                    94.95111
## Dim.162 8.649843e+01     1.751477e-01                    95.12626
## Dim.163 8.649364e+01     1.751380e-01                    95.30140
## Dim.164 8.623412e+01     1.746125e-01                    95.47601
## Dim.165 8.536692e+01     1.728565e-01                    95.64887
## Dim.166 8.480570e+01     1.717201e-01                    95.82059
## Dim.167 8.459615e+01     1.712958e-01                    95.99188
## Dim.168 8.427735e+01     1.706503e-01                    96.16253
## Dim.169 8.364465e+01     1.693691e-01                    96.33190
## Dim.170 8.330717e+01     1.686858e-01                    96.50059
## Dim.171 8.288001e+01     1.678209e-01                    96.66841
## Dim.172 8.251160e+01     1.670749e-01                    96.83549
## Dim.173 8.203175e+01     1.661032e-01                    97.00159
## Dim.174 8.135533e+01     1.647336e-01                    97.16632
## Dim.175 8.111856e+01     1.642542e-01                    97.33058
## Dim.176 8.026267e+01     1.625211e-01                    97.49310
## Dim.177 7.982859e+01     1.616422e-01                    97.65474
## Dim.178 7.862510e+01     1.592052e-01                    97.81394
## Dim.179 7.771145e+01     1.573552e-01                    97.97130
## Dim.180 7.726159e+01     1.564443e-01                    98.12774
## Dim.181 7.711649e+01     1.561505e-01                    98.28389
## Dim.182 7.611813e+01     1.541290e-01                    98.43802
## Dim.183 7.555704e+01     1.529928e-01                    98.59102
## Dim.184 7.532035e+01     1.525136e-01                    98.74353
## Dim.185 7.432012e+01     1.504882e-01                    98.89402
## Dim.186 7.305381e+01     1.479241e-01                    99.04194
## Dim.187 7.236339e+01     1.465261e-01                    99.18847
## Dim.188 7.116562e+01     1.441008e-01                    99.33257
## Dim.189 6.900235e+01     1.397205e-01                    99.47229
## Dim.190 6.858803e+01     1.388815e-01                    99.61117
## Dim.191 6.713481e+01     1.359390e-01                    99.74711
## Dim.192 6.337820e+01     1.283323e-01                    99.87544
## Dim.193 6.151393e+01     1.245574e-01                   100.00000
## Dim.194 3.072873e-26     6.222155e-29                   100.00000

I decided that I will use only those PCs that explain more than 1% of variance.

Classification

Data preparation - splitting data to train and test

Our data is balanced, so we will take 15 random samples from each type of cancer as the test data. The best approach would be to perform k-fold cross validation but with 5 types of models it could take too much time.

set.seed(123)
df %>% dplyr::select(samples,type) %>% filter(type=='adenocarcinoma') %>% pull(samples) %>% 
  sample(15) ->t_1

df %>% dplyr::select(samples,type) %>% filter(type=='normal') %>% pull(samples) %>% 
  sample(15) ->t_2

train <- df %>% filter(!samples %in% c(t_1,t_2))
test <- df %>% filter(samples %in% c(t_1,t_2))

train_X <- train %>% dplyr::select(-samples,-type)
train_Y <- train %>% dplyr::select(type)
test_X <- test %>% dplyr::select(-samples,-type)
test_Y <- test %>% dplyr::select(type)

Let`s perform also PCA on data and visualise which sample was taken as test data. As I said previously I will take only PCs with Variance explained greater than 1%. It turned out to be first 12 PCs

X_PCA<-prcomp(train_X,center=T,scale.=T)
test_X_N <- predict(X_PCA,test_X)
train_X_N <- X_PCA$x[,1:12]
test_X_N <- test_X_N[,1:12]

plot(train_X_N[,1:2],col='black',cex=1.5)
points(test_X_N[,1:2],pch=21,bg="purple",cex=1.5)
legend('bottomright',legend=c('train','test'),fill=c('black','purple'))

As we can see on the plot almost all test data are in the center of groups. Most of the algorithms will probably have no problem classifying them. Now I will create matrices to store computing time and accuracy values.

df_acc <- data.frame( train_test = character(), CLARA=numeric(),CLARA_PCA=numeric(),
                     KMEANS=numeric(),KMEANS_PCA=numeric(),
                     HCLUST=numeric(),HCLUST_PCA=numeric(),SVM=numeric(),SVM_PCA=numeric(),
                     RF=numeric(),RF_PCA=numeric())
df_acc[1:2,2:11] <- 0
df_acc$train_test <- c('train','test')

#creating matrix to store time computing 
df_time <- data.frame( CLARA=numeric(),CLARA_PCA=numeric(),
                      KMEANS=numeric(),KMEANS_PCA=numeric(),
                      HCLUST=numeric(),HCLUST_PCA=numeric(),SVM=numeric(),SVM_PCA=numeric(),
                      RF=numeric(),RF_PCA=numeric())
df_time[1,1:10] <- 0
rownames(df_time) <- 'train_time'

Perfoming classification

First 3 algorithms that will be used are clustering methods. They group objects in a way that in the same group(cluster) observations are more similar to each other than to those in other groups. First method - CLARA(Clustering Large Applications) is an extension to k-medoids (PAM) methods to deal with data containing a large number of objects in order to reduce computing time. This is achieved using the sampling approach. Next method is k-means. This is the most popular clustering method that partition observations to k clusters in which observations belong to the cluster with the nearest mean (centroid). The last clustering method is hclust - hierarchical clustering that seeks to build a hierarchy of clusters. Other two methods are supervised algorithms. Support Vector Machines(SVM) uses linear or non-linear classifier that separates the space, into two or more regions that are called classes. Random forest is an ensemble method that trains several decision trees in parallel with bootstrapping followed by aggregation. Now I am going to perform each model on raw data and dimension reduced data.

Classifications on raw data

Clara

Training

ptm_1 <- proc.time()
clara_train <- eclust(train_X,"clara", hc_metric="euclidean",k=2,graph=F)
t_1 <- proc.time() - ptm_1
table_1 <- table(clara_train$cluster,train_Y$type)
table_1
##    
##     adenocarcinoma normal
##   1              6     82
##   2             76      0
score_clara_train <- sum(apply(table_1, 2, max))/164
df_acc[1,"CLARA"] <- round(score_clara_train*100,2)
df_time[1] <- t_1[3]
paste('Accuracy on training data:', score_clara_train)
## [1] "Accuracy on training data: 0.963414634146341"
paste('Model trained in:' , t_1[3] ,' sec.')
## [1] "Model trained in: 29.97  sec."

Test

clara_test_kcca<-as.kcca(clara_train, train_X) 
clara_pred<-predict(clara_test_kcca, test_X) 
table_11 <- table(clara_pred,test_Y$type)
table_11
##           
## clara_pred adenocarcinoma normal
##          1              0     15
##          2             15      0
score_clara_test <- sum(apply(table_11, 2, max))/30
df_acc[2,"CLARA"] <- round(score_clara_test*100,2)
paste('Accuracy on test data:', score_clara_test)
## [1] "Accuracy on test data: 1"

Kmeans

Training

ptm_2 <- proc.time()
kmeans_train <- eclust(train_X,"kmeans", hc_metric="euclidean",k=2,graph=F)
t_2 <- proc.time() - ptm_2
table_2 <- table(kmeans_train$cluster,train_Y$type)
table_2
##    
##     adenocarcinoma normal
##   1             80      0
##   2              2     82
score_kmeans_train <- sum(apply(table_2, 2, max))/164
df_acc[1,"KMEANS"] <- round(score_kmeans_train*100,2)
df_time[3] <- t_2[3]
paste('Accuracy on training data:', score_kmeans_train)
## [1] "Accuracy on training data: 0.98780487804878"
paste('Model trained in:' , t_2[3] ,' sec.')
## [1] "Model trained in: 24.27  sec."

Test

kmeans_test_kcca<-as.kcca(kmeans_train, train_X) 
kmeans_pred<-predict(kmeans_test_kcca, test_X) 
table_21 <- table(kmeans_pred,test_Y$type)
table_21
##            
## kmeans_pred adenocarcinoma normal
##           1             15      0
##           2              0     15
kmeans_clara_test <- sum(apply(table_21, 2, max))/30
df_acc[2,"KMEANS"] <- round(kmeans_clara_test*100,2)
paste('Accuracy on test data:', kmeans_clara_test)
## [1] "Accuracy on test data: 1"

Hclust

Training

ptm_3 <- proc.time()
hclust_train <- eclust(train_X,"hclust", hc_metric="euclidean",k=2,graph=F)
t_3 <- proc.time() - ptm_3
table_3 <- table(hclust_train$cluster,train_Y$type)
table_3
##    
##     adenocarcinoma normal
##   1              1     80
##   2             81      2
score_hclust_train <- sum(apply(table_3, 2, max))/164
df_acc[1,"HCLUST"] <- round(score_hclust_train*100,2)
df_time[5] <- t_3[3]
paste('Accuracy on training data:', score_hclust_train)
## [1] "Accuracy on training data: 0.981707317073171"
paste('Model trained in:' , t_3[3] ,' sec.')
## [1] "Model trained in: 29.19  sec."

Test

groups<-cutree(hclust_train,k=2)
hclust_pred<-knn(train_X, test_X,k=1,cl=groups) 
table_31 <- table(hclust_pred,test_Y$type)
table_31
##            
## hclust_pred adenocarcinoma normal
##           1              0     15
##           2             15      0
kmeans_hclust_test <- sum(apply(table_31, 2, max))/30
df_acc[2,"HCLUST"] <- round(kmeans_hclust_test*100,2)
paste('Accuracy on test data:', kmeans_hclust_test)
## [1] "Accuracy on test data: 1"

SVM

Training

ptm_4 <- proc.time()
model4 <- svm(x=train_X,y=factor(train_Y$type), kernel = "linear", scale = FALSE)
t_4 <- proc.time() - ptm_4
table_4 <- table(model4$fitted,train_Y$type)
table_4
##                 
##                  adenocarcinoma normal
##   adenocarcinoma             82      0
##   normal                      0     82
svm_train <- sum(apply(table_4, 2, max))/164
df_acc[1,"SVM"] <- round(svm_train*100,2)
df_time[7] <- t_4[3]
paste('Accuracy on training data:', svm_train)
## [1] "Accuracy on training data: 1"
paste('Model trained in:' , t_4[3] ,' sec.')
## [1] "Model trained in: 4.69  sec."

Test

model4_predict <- predict(model4,test_X)
table_41 <- table(model4_predict,test_Y$type)
table_41
##                 
## model4_predict   adenocarcinoma normal
##   adenocarcinoma             15      0
##   normal                      0     15
svm_test <- sum(apply(table_41, 2, max))/30
df_acc[2,"SVM"] <- round(svm_test*100,2)
paste('Accuracy on test data:', svm_test)
## [1] "Accuracy on test data: 1"

Random Forest

Training

ptm_5 <- proc.time()
model5 <- randomForest(x=train_X,y=factor(train_Y$type))
t_5 <- proc.time() - ptm_5
table_5 <- table(model5$predicted,train_Y$type)
table_5
##                 
##                  adenocarcinoma normal
##   adenocarcinoma             79      1
##   normal                      3     81
rf_train <- sum(apply(table_5, 2, max))/164
df_acc[1,"RF"] <- round(rf_train*100,2)
df_time[9] <- t_5[3]
paste('Accuracy on training data:', rf_train)
## [1] "Accuracy on training data: 0.975609756097561"
paste('Model trained in:' , t_5[3] ,' sec.')
## [1] "Model trained in: 77.95  sec."

Test

model5_predict <- predict(model5,test_X)
table_51 <- table(model5_predict,test_Y$type)
table_51
##                 
## model5_predict   adenocarcinoma normal
##   adenocarcinoma             15      0
##   normal                      0     15
rf_test <- sum(apply(table_51, 2, max))/30
df_acc[2,"RF"] <- round(rf_test*100,2)
paste('Accuracy on test data:', rf_test)
## [1] "Accuracy on test data: 1"

Comments about classification on raw data

As we can see all algorithms did not have any problems predicting test data. When it comes to training data the best algorithm is SVM, it classified whole training set correctly. This is not a surprise, as SVM tends to overfit data and we got many columns. Clara fitted the data worst, there were 6 observations wrongly classified. RF missclasified 3, Kmeans 2 and Hclust 3. When it comes to computing time SVM is the winner, it only took few seconds to run. Classification methods needed from 20 to 30 seconds and surprisingly for me RF was the slowest, it took more than 1 minute to run.

Classifications on PCA

Clara

Training

ptm_6 <- proc.time()
clara_train_pca <- eclust(train_X_N,"clara", hc_metric="euclidean",k=2,graph = F)
t_6 <- proc.time() - ptm_6
table_1_pca <- table(clara_train_pca$cluster,train_Y$type)
table_1_pca
##    
##     adenocarcinoma normal
##   1              3     81
##   2             79      1
score_clara_train_pca <- sum(apply(table_1_pca, 2, max))/164
df_acc[1,"CLARA_PCA"] <- round(score_clara_train_pca*100,2)
df_time[2] <- t_6[3]
paste('Accuracy on test data:', score_clara_train_pca)
## [1] "Accuracy on test data: 0.975609756097561"
paste('Model trained in:' , t_6[3] ,' sec.')
## [1] "Model trained in: 0.00999999999999091  sec."

Test

clara_test_kcca_pca<-as.kcca(clara_train_pca, train_X_N) 
clara_pred_pca<-predict(clara_test_kcca_pca, test_X_N) 
table_11_pca <- table(clara_pred_pca,test_Y$type)
table_11_pca
##               
## clara_pred_pca adenocarcinoma normal
##              1              0     15
##              2             15      0
score_clara_test_pca <- sum(apply(table_11_pca, 2, max))/30
df_acc[2,"CLARA_PCA"] <- round(score_clara_test_pca*100,2)
paste('Accuracy on test data:', score_clara_test_pca)
## [1] "Accuracy on test data: 1"

Kmeans

Training

ptm_7 <- proc.time()
kmeans_train_pca <- eclust(train_X_N,"kmeans", hc_metric="euclidean",k=2,graph=F)
t_7 <- proc.time() - ptm_7
table_2_pca <- table(kmeans_train_pca$cluster,train_Y$type)
table_2_pca
##    
##     adenocarcinoma normal
##   1             78      0
##   2              4     82
score_kmeans_train_pca <- sum(apply(table_2_pca, 2, max))/164
df_acc[1,"KMEANS_PCA"] <- round(score_kmeans_train_pca*100,2)
df_time[4] <- t_7[3]
paste('Accuracy on test data:', score_kmeans_train_pca)
## [1] "Accuracy on test data: 0.975609756097561"
paste('Model trained in:' , t_7[3] ,' sec.')
## [1] "Model trained in: 0  sec."

Test

kmeans_test_kcca_pca<-as.kcca(kmeans_train_pca, train_X_N) 
kmeans_pred_pca<-predict(kmeans_test_kcca_pca, test_X_N) 
table_21_pca <- table(kmeans_pred_pca,test_Y$type)
table_21_pca
##                
## kmeans_pred_pca adenocarcinoma normal
##               1             15      0
##               2              0     15
score_kmeans_test_pca <- sum(apply(table_21_pca, 2, max))/30
df_acc[2,"KMEANS_PCA"] <- round(score_kmeans_test_pca*100,2)
paste('Accuracy on test data:', score_kmeans_test_pca)
## [1] "Accuracy on test data: 1"

Hclust

Training

ptm_8 <- proc.time()
hclust_train_pca <- eclust(train_X_N,"hclust", hc_metric="euclidean",k=2,graph=F)
t_8 <- proc.time() - ptm_8
table_3_pca <- table(hclust_train_pca$cluster,train_Y$type)
table_3_pca
##    
##     adenocarcinoma normal
##   1              2     82
##   2             80      0
score_hclust_train_pca <- sum(apply(table_3_pca, 2, max))/164
df_acc[1,"HCLUST_PCA"] <- round(score_hclust_train_pca*100,2)
df_time[6] <- t_8[3]
paste('Accuracy on test data:', score_hclust_train_pca)
## [1] "Accuracy on test data: 0.98780487804878"
paste('Model trained in:' , t_8[3] ,' sec.')
## [1] "Model trained in: 0.0100000000000193  sec."

Test

groups_N<-cutree(hclust_train_pca,k=2)
hclust_pred_N<-knn(train_X_N, test_X_N,k=1,cl=groups_N) 
table_31_pca <- table(hclust_pred_N,test_Y$type)
table_31_pca
##              
## hclust_pred_N adenocarcinoma normal
##             1              0     15
##             2             15      0
score_hclust_test_PCA <- sum(apply(table_31_pca, 2, max))/30
df_acc[2,"HCLUST_PCA"] <- round(score_hclust_test_PCA*100,2)
paste('Accuracy on test data:', score_hclust_test_PCA)
## [1] "Accuracy on test data: 1"

SVM

Training

ptm_9 <- proc.time()
model4_PCA <- svm(x=train_X_N,y=factor(train_Y$type), kernel = "linear", scale = FALSE)
t_9 <- proc.time() - ptm_9
table_4_PCA <- table(model4_PCA$fitted,train_Y$type)
table_4_PCA
##                 
##                  adenocarcinoma normal
##   adenocarcinoma             82      0
##   normal                      0     82
svm_train_PCA <- sum(apply(table_4_PCA, 2, max))/164
df_acc[1,"SVM_PCA"] <- round(svm_train_PCA*100,2)
df_time[8] <- t_9[3]
paste('Accuracy on test data:', svm_train_PCA)
## [1] "Accuracy on test data: 1"
paste('Model trained in:' , t_9[3] ,' sec.')
## [1] "Model trained in: 0  sec."

Test

model4_predict_PCA <- predict(model4_PCA,test_X_N)
table_41_PCA <- table(model4_predict_PCA,test_Y$type)
table_41_PCA
##                   
## model4_predict_PCA adenocarcinoma normal
##     adenocarcinoma             15      0
##     normal                      0     15
svm_test_PCA <- sum(apply(table_41_PCA, 2, max))/30
df_acc[2,"SVM_PCA"] <- round(svm_test_PCA*100,2)
paste('Accuracy on test data:', svm_test_PCA)
## [1] "Accuracy on test data: 1"

Random Forest

Training

ptm_10 <- proc.time()
model5_PCA <- randomForest(x=train_X_N,y=factor(train_Y$type))
t_10 <- proc.time() - ptm_10
table_5_PCA <- table(model5_PCA$predicted,train_Y$type)
table_5_PCA
##                 
##                  adenocarcinoma normal
##   adenocarcinoma             81      1
##   normal                      1     81
rf_train_PCA <- sum(apply(table_5_PCA, 2, max))/164
df_acc[1,"RF_PCA"] <- round(rf_train_PCA*100,2)
df_time[10] <- t_10[3]
paste('Accuracy on test data:', rf_train_PCA)
## [1] "Accuracy on test data: 0.98780487804878"
paste('Model trained in:' , t_10[3] ,' sec.')
## [1] "Model trained in: 0.0499999999999829  sec."

Test

model5_predict_PCA <- predict(model5_PCA,test_X_N)
table_51_PCA <- table(model5_predict_PCA,test_Y$type)
table_51_PCA
##                   
## model5_predict_PCA adenocarcinoma normal
##     adenocarcinoma             15      0
##     normal                      0     15
rf_test_PCA <- sum(apply(table_51_PCA, 2, max))/30
df_acc[2,"RF_PCA"] <- round(rf_test_PCA*100,2)
paste('Accuracy on test data:', rf_test_PCA)
## [1] "Accuracy on test data: 1"

Comments

At first glance we can see that computing time lowered drastically. Only random forest needs few seconds to run, all other algorithms run in the nick of time. When it comes to accuracy, one more time all algorithms predict in 100% test data. Classification on training data is also good, CLara and Kmeans classifed 4 observations wrong, Hclust and Random Forest 2 wrong and SVM classified everything good.

Comparing classification on raw data vs PCA

Firstly let`s print table of computation times.

df_time %>% kable()
CLARA CLARA_PCA KMEANS KMEANS_PCA HCLUST HCLUST_PCA SVM SVM_PCA RF RF_PCA
train_time 29.97 0.01 24.27 0 29.19 0.01 4.69 0 77.95 0.05

As we can see there is incredible difference between running time of algorithms on raw data vs on data after PCA. Let`s look at accuracy matrix now.

df_acc %>% kable()
train_test CLARA CLARA_PCA KMEANS KMEANS_PCA HCLUST HCLUST_PCA SVM SVM_PCA RF RF_PCA
train 96.34 97.56 98.78 97.56 98.17 98.78 100 100 97.56 98.78
test 100.00 100.00 100.00 100.00 100.00 100.00 100 100 100.00 100.00

We can see that accuracy even improved on PCA. Only Kmeans perform worse after PCA. This is great information, we did not have trade off between computing time and accuracy.

Summary

Goal of this paper was to check how dimension reduction with PCA influence data classification procedure. We can definitely see that running time of algorithms drops drastically. Although making statements about accuracy is tricky here. It seems like PCA does not have any negative influence on predictions and classifications. Yet metodology was not perfect here, k-fold cross validation should be performed. Data itself is not ordinary, there are more columns than rows. Moreover SVM seems to be the best classification method in comparison to other on used data.

Bibliography

https://stanford.edu/~cpiech/cs221/handouts/kmeans.html

https://www.datanovia.com/en/lessons/clara-in-r-clustering-large-applications/

https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/hclust

https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

https://stackoverflow.com/questions/42325276/how-to-use-pca-on-test-set-code