Principal Component Analysis is a process for reduction of dimensionality of datasets. The main benefit is increased interpretability however piece of an information from a dataset is lost. The purpose of the following report is to determine how each attribute of the affects interpretability of Tumor. Along with Principal Component Analysis, T - distributed stochastic neighbour embedding method is used. The dataset presents attributes of 569 breast tumors. They are diagnosed whether a tumor is malignant (cancerous) or bening (not cancerous).
The dataset was downloaded from the Kaggle: https://www.kaggle.com/yasserh/breast-cancer-diagnosis-best-ml-algorithms/data
The following libraries were used:
library(ggfortify)
library(stats)
library(FactoMineR)
library(factoextra)
library(Rtsne)
library(gridExtra)
library(grid)
library(ggplot2)
library(lattice)
br_df <- read.csv("breast-cancer.csv")
The dataset is checked in terms of Null values.
any(is.na.data.frame(br_df))
## [1] FALSE
No missing entries are present in the dataset.
summary(br_df)
## id diagnosis radius_mean texture_mean
## Min. : 8670 Length:569 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 Class :character 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Mode :character Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## perimeter_mean area_mean smoothness_mean compactness_mean
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## concavity_mean concave.points_mean symmetry_mean fractal_dimension_mean
## Min. :0.00000 Min. :0.00000 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.06154 Median :0.03350 Median :0.1792 Median :0.06154
## Mean :0.08880 Mean :0.04892 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.42680 Max. :0.20120 Max. :0.3040 Max. :0.09744
## radius_se texture_se perimeter_se area_se
## Min. :0.1115 Min. :0.3602 Min. : 0.757 Min. : 6.802
## 1st Qu.:0.2324 1st Qu.:0.8339 1st Qu.: 1.606 1st Qu.: 17.850
## Median :0.3242 Median :1.1080 Median : 2.287 Median : 24.530
## Mean :0.4052 Mean :1.2169 Mean : 2.866 Mean : 40.337
## 3rd Qu.:0.4789 3rd Qu.:1.4740 3rd Qu.: 3.357 3rd Qu.: 45.190
## Max. :2.8730 Max. :4.8850 Max. :21.980 Max. :542.200
## smoothness_se compactness_se concavity_se concave.points_se
## Min. :0.001713 Min. :0.002252 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.005169 1st Qu.:0.013080 1st Qu.:0.01509 1st Qu.:0.007638
## Median :0.006380 Median :0.020450 Median :0.02589 Median :0.010930
## Mean :0.007041 Mean :0.025478 Mean :0.03189 Mean :0.011796
## 3rd Qu.:0.008146 3rd Qu.:0.032450 3rd Qu.:0.04205 3rd Qu.:0.014710
## Max. :0.031130 Max. :0.135400 Max. :0.39600 Max. :0.052790
## symmetry_se fractal_dimension_se radius_worst texture_worst
## Min. :0.007882 Min. :0.0008948 Min. : 7.93 Min. :12.02
## 1st Qu.:0.015160 1st Qu.:0.0022480 1st Qu.:13.01 1st Qu.:21.08
## Median :0.018730 Median :0.0031870 Median :14.97 Median :25.41
## Mean :0.020542 Mean :0.0037949 Mean :16.27 Mean :25.68
## 3rd Qu.:0.023480 3rd Qu.:0.0045580 3rd Qu.:18.79 3rd Qu.:29.72
## Max. :0.078950 Max. :0.0298400 Max. :36.04 Max. :49.54
## perimeter_worst area_worst smoothness_worst compactness_worst
## Min. : 50.41 Min. : 185.2 Min. :0.07117 Min. :0.02729
## 1st Qu.: 84.11 1st Qu.: 515.3 1st Qu.:0.11660 1st Qu.:0.14720
## Median : 97.66 Median : 686.5 Median :0.13130 Median :0.21190
## Mean :107.26 Mean : 880.6 Mean :0.13237 Mean :0.25427
## 3rd Qu.:125.40 3rd Qu.:1084.0 3rd Qu.:0.14600 3rd Qu.:0.33910
## Max. :251.20 Max. :4254.0 Max. :0.22260 Max. :1.05800
## concavity_worst concave.points_worst symmetry_worst fractal_dimension_worst
## Min. :0.0000 Min. :0.00000 Min. :0.1565 Min. :0.05504
## 1st Qu.:0.1145 1st Qu.:0.06493 1st Qu.:0.2504 1st Qu.:0.07146
## Median :0.2267 Median :0.09993 Median :0.2822 Median :0.08004
## Mean :0.2722 Mean :0.11461 Mean :0.2901 Mean :0.08395
## 3rd Qu.:0.3829 3rd Qu.:0.16140 3rd Qu.:0.3179 3rd Qu.:0.09208
## Max. :1.2520 Max. :0.29100 Max. :0.6638 Max. :0.20750
The subset of breast cancer dataframe is created with excluded id and diagnosis.
br_df_s <- subset(br_df, select = -c(id,diagnosis))
br_labels <- subset(br_df, select = c(id,diagnosis))
Principal components are calculated to determine eigenvalues.
pca=princomp(br_df_s, cor=TRUE)
summary(pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 3.6443940 2.3856560 1.67867477 1.40735229 1.28402903
## Proportion of Variance 0.4427203 0.1897118 0.09393163 0.06602135 0.05495768
## Cumulative Proportion 0.4427203 0.6324321 0.72636371 0.79238506 0.84734274
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 1.09879780 0.82171778 0.69037464 0.64567392 0.59219377
## Proportion of Variance 0.04024522 0.02250734 0.01588724 0.01389649 0.01168978
## Cumulative Proportion 0.88758796 0.91009530 0.92598254 0.93987903 0.95156881
## Comp.11 Comp.12 Comp.13 Comp.14
## Standard deviation 0.54213992 0.511039500 0.49128148 0.396244525
## Proportion of Variance 0.00979719 0.008705379 0.00804525 0.005233657
## Cumulative Proportion 0.96136600 0.970071383 0.97811663 0.983350291
## Comp.15 Comp.16 Comp.17 Comp.18
## Standard deviation 0.306814219 0.282600072 0.243719178 0.229387845
## Proportion of Variance 0.003137832 0.002662093 0.001979968 0.001753959
## Cumulative Proportion 0.986488123 0.989150216 0.991130184 0.992884143
## Comp.19 Comp.20 Comp.21 Comp.22
## Standard deviation 0.222435590 0.176520261 0.1731268145 0.1656484305
## Proportion of Variance 0.001649253 0.001038647 0.0009990965 0.0009146468
## Cumulative Proportion 0.994533397 0.995572043 0.9965711397 0.9974857865
## Comp.23 Comp.24 Comp.25 Comp.26
## Standard deviation 0.1560155049 0.1343689213 0.1244237573 0.090430304
## Proportion of Variance 0.0008113613 0.0006018336 0.0005160424 0.000272588
## Cumulative Proportion 0.9982971477 0.9988989813 0.9994150237 0.999687612
## Comp.27 Comp.28 Comp.29 Comp.30
## Standard deviation 0.0830690308 3.986650e-02 0.0273642668 1.153451e-02
## Proportion of Variance 0.0002300155 5.297793e-05 0.0000249601 4.434827e-06
## Cumulative Proportion 0.9999176271 9.999706e-01 0.9999955652 1.000000e+00
Two Principal components with the greatest eigenvalues explain 63 percent of variation. For considered first 7 components only, it is feasible to obtain 91 percent of explained variation.
res.pca <- PCA(br_df_s, scale.unit = TRUE, graph = FALSE)
print(res.pca)
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 569 individuals, described by 30 variables
## *The results are available in the following objects:
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$var" "results for the variables"
## 3 "$var$coord" "coord. for the variables"
## 4 "$var$cor" "correlations variables - dimensions"
## 5 "$var$cos2" "cos2 for the variables"
## 6 "$var$contrib" "contributions of the variables"
## 7 "$ind" "results for the individuals"
## 8 "$ind$coord" "coord. for the individuals"
## 9 "$ind$cos2" "cos2 for the individuals"
## 10 "$ind$contrib" "contributions of the individuals"
## 11 "$call" "summary statistics"
## 12 "$call$centre" "mean of the variables"
## 13 "$call$ecart.type" "standard error of the variables"
## 14 "$call$row.w" "weights for the individuals"
## 15 "$call$col.w" "weights for the variables"
get_eigenvalue(res.pca)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 1.328161e+01 4.427203e+01 44.27203
## Dim.2 5.691355e+00 1.897118e+01 63.24321
## Dim.3 2.817949e+00 9.393163e+00 72.63637
## Dim.4 1.980640e+00 6.602135e+00 79.23851
## Dim.5 1.648731e+00 5.495768e+00 84.73427
## Dim.6 1.207357e+00 4.024522e+00 88.75880
## Dim.7 6.752201e-01 2.250734e+00 91.00953
## Dim.8 4.766171e-01 1.588724e+00 92.59825
## Dim.9 4.168948e-01 1.389649e+00 93.98790
## Dim.10 3.506935e-01 1.168978e+00 95.15688
## Dim.11 2.939157e-01 9.797190e-01 96.13660
## Dim.12 2.611614e-01 8.705379e-01 97.00714
## Dim.13 2.413575e-01 8.045250e-01 97.81166
## Dim.14 1.570097e-01 5.233657e-01 98.33503
## Dim.15 9.413497e-02 3.137832e-01 98.64881
## Dim.16 7.986280e-02 2.662093e-01 98.91502
## Dim.17 5.939904e-02 1.979968e-01 99.11302
## Dim.18 5.261878e-02 1.753959e-01 99.28841
## Dim.19 4.947759e-02 1.649253e-01 99.45334
## Dim.20 3.115940e-02 1.038647e-01 99.55720
## Dim.21 2.997289e-02 9.990965e-02 99.65711
## Dim.22 2.743940e-02 9.146468e-02 99.74858
## Dim.23 2.434084e-02 8.113613e-02 99.82971
## Dim.24 1.805501e-02 6.018336e-02 99.88990
## Dim.25 1.548127e-02 5.160424e-02 99.94150
## Dim.26 8.177640e-03 2.725880e-02 99.96876
## Dim.27 6.900464e-03 2.300155e-02 99.99176
## Dim.28 1.589338e-03 5.297793e-03 99.99706
## Dim.29 7.488031e-04 2.496010e-03 99.99956
## Dim.30 1.330448e-04 4.434827e-04 100.00000
The positively correlated variables are placed close to each other. In case of negatively correlated values, they are moving towards opposing direction. Based on the graph, it is difficult to distinct the components.
fviz_eig(pca)
fviz_cos2(res.pca, choice = "var", axes = 1:2)
fviz_pca_var(res.pca, col.var = "cos2",
gradient.cols = c("blue", "yellow", "red"),
repel = TRUE)
The PCA is calculated with distincted diagnosis:
pca_res <- prcomp(br_df_s, scale. = TRUE)
autoplot(pca_res, data = br_df, colour = 'diagnosis', label = TRUE, label.size = 3)
pc1 <- fviz_contrib(pca, choice = "var", axes = 1)
pc2 <- fviz_contrib(pca, choice = "var", axes = 2)
grid.arrange(pc1, pc2)
The first 16 principal components in the first dimension are above mean level. In the second dimension, 15 principal component exceed mean eigenvalue level.
T-SNE technique represents structure of a dataset in a manner that close points in the input space will tend to remain close in the low dimensional space.
To perform T-SNE, dataset needs to be standarized.
br_df_z = as.data.frame(sapply(br_df_s, as.numeric))
tsne <- Rtsne(br_df_z, dims = 2, perplexity=30, verbose=TRUE, max_iter = 500)
## Performing PCA
## Read the 569 x 30 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.08 seconds (sparsity = 0.189300)!
## Learning embedding...
## Iteration 50: error is 53.342515 (50 iterations in 0.05 seconds)
## Iteration 100: error is 48.791772 (50 iterations in 0.04 seconds)
## Iteration 150: error is 47.991321 (50 iterations in 0.04 seconds)
## Iteration 200: error is 47.711191 (50 iterations in 0.04 seconds)
## Iteration 250: error is 47.624986 (50 iterations in 0.04 seconds)
## Iteration 300: error is 0.365186 (50 iterations in 0.04 seconds)
## Iteration 350: error is 0.291586 (50 iterations in 0.03 seconds)
## Iteration 400: error is 0.280341 (50 iterations in 0.04 seconds)
## Iteration 450: error is 0.273123 (50 iterations in 0.04 seconds)
## Iteration 500: error is 0.267208 (50 iterations in 0.04 seconds)
## Fitting performed in 0.40 seconds.
colors = rainbow(length(unique(br_df$diagnosis)))
names(colors) = unique(br_df$diagnosis)
par(mgp=c(2.5,1,0))
plot(tsne$Y, t='n', main="tSNE", xlab="tSNE dimension 1", ylab="tSNE dimension 2", "cex.main"=2, "cex.lab"=1.5)
text(tsne$Y, labels=br_df$diagnosis, col=colors[br_df$diagnosis])
The red points represent malignant tumors (cancerous), the blue ones represent bening tumors (non-cancerous). The above graph proves that both types are distinguishable.
Conclusion PCA and T-SNE proves that the tumors might be well distinguished taking in the consideration provided attributes. Only 7 out of 30 variables would be needed to explain 91 percent of variation. Dimension reduction is performs well in analysis of large datasets.