Breast Cancer PCA

Principal Component Analysis is a process for reduction of dimensionality of datasets. The main benefit is increased interpretability however piece of an information from a dataset is lost. The purpose of the following report is to determine how each attribute of the affects interpretability of Tumor. Along with Principal Component Analysis, T - distributed stochastic neighbour embedding method is used. The dataset presents attributes of 569 breast tumors. They are diagnosed whether a tumor is malignant (cancerous) or bening (not cancerous).

The dataset was downloaded from the Kaggle: https://www.kaggle.com/yasserh/breast-cancer-diagnosis-best-ml-algorithms/data

The following libraries were used:

library(ggfortify)
library(stats)
library(FactoMineR)
library(factoextra)
library(Rtsne)
library(gridExtra)
library(grid)
library(ggplot2)
library(lattice)

br_df <- read.csv("breast-cancer.csv")

The dataset is checked in terms of Null values.

any(is.na.data.frame(br_df))

## [1] FALSE

No missing entries are present in the dataset.

summary(br_df)

##        id             diagnosis          radius_mean      texture_mean  
##  Min.   :     8670   Length:569         Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   Class :character   1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024   Mode  :character   Median :13.370   Median :18.84  
##  Mean   : 30371831                      Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129                      3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502                      Max.   :28.110   Max.   :39.28  
##  perimeter_mean     area_mean      smoothness_mean   compactness_mean 
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##  concavity_mean    concave.points_mean symmetry_mean    fractal_dimension_mean
##  Min.   :0.00000   Min.   :0.00000     Min.   :0.1060   Min.   :0.04996       
##  1st Qu.:0.02956   1st Qu.:0.02031     1st Qu.:0.1619   1st Qu.:0.05770       
##  Median :0.06154   Median :0.03350     Median :0.1792   Median :0.06154       
##  Mean   :0.08880   Mean   :0.04892     Mean   :0.1812   Mean   :0.06280       
##  3rd Qu.:0.13070   3rd Qu.:0.07400     3rd Qu.:0.1957   3rd Qu.:0.06612       
##  Max.   :0.42680   Max.   :0.20120     Max.   :0.3040   Max.   :0.09744       
##    radius_se        texture_se      perimeter_se       area_se       
##  Min.   :0.1115   Min.   :0.3602   Min.   : 0.757   Min.   :  6.802  
##  1st Qu.:0.2324   1st Qu.:0.8339   1st Qu.: 1.606   1st Qu.: 17.850  
##  Median :0.3242   Median :1.1080   Median : 2.287   Median : 24.530  
##  Mean   :0.4052   Mean   :1.2169   Mean   : 2.866   Mean   : 40.337  
##  3rd Qu.:0.4789   3rd Qu.:1.4740   3rd Qu.: 3.357   3rd Qu.: 45.190  
##  Max.   :2.8730   Max.   :4.8850   Max.   :21.980   Max.   :542.200  
##  smoothness_se      compactness_se      concavity_se     concave.points_se 
##  Min.   :0.001713   Min.   :0.002252   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.005169   1st Qu.:0.013080   1st Qu.:0.01509   1st Qu.:0.007638  
##  Median :0.006380   Median :0.020450   Median :0.02589   Median :0.010930  
##  Mean   :0.007041   Mean   :0.025478   Mean   :0.03189   Mean   :0.011796  
##  3rd Qu.:0.008146   3rd Qu.:0.032450   3rd Qu.:0.04205   3rd Qu.:0.014710  
##  Max.   :0.031130   Max.   :0.135400   Max.   :0.39600   Max.   :0.052790  
##   symmetry_se       fractal_dimension_se  radius_worst   texture_worst  
##  Min.   :0.007882   Min.   :0.0008948    Min.   : 7.93   Min.   :12.02  
##  1st Qu.:0.015160   1st Qu.:0.0022480    1st Qu.:13.01   1st Qu.:21.08  
##  Median :0.018730   Median :0.0031870    Median :14.97   Median :25.41  
##  Mean   :0.020542   Mean   :0.0037949    Mean   :16.27   Mean   :25.68  
##  3rd Qu.:0.023480   3rd Qu.:0.0045580    3rd Qu.:18.79   3rd Qu.:29.72  
##  Max.   :0.078950   Max.   :0.0298400    Max.   :36.04   Max.   :49.54  
##  perimeter_worst    area_worst     smoothness_worst  compactness_worst
##  Min.   : 50.41   Min.   : 185.2   Min.   :0.07117   Min.   :0.02729  
##  1st Qu.: 84.11   1st Qu.: 515.3   1st Qu.:0.11660   1st Qu.:0.14720  
##  Median : 97.66   Median : 686.5   Median :0.13130   Median :0.21190  
##  Mean   :107.26   Mean   : 880.6   Mean   :0.13237   Mean   :0.25427  
##  3rd Qu.:125.40   3rd Qu.:1084.0   3rd Qu.:0.14600   3rd Qu.:0.33910  
##  Max.   :251.20   Max.   :4254.0   Max.   :0.22260   Max.   :1.05800  
##  concavity_worst  concave.points_worst symmetry_worst   fractal_dimension_worst
##  Min.   :0.0000   Min.   :0.00000      Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.1145   1st Qu.:0.06493      1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2267   Median :0.09993      Median :0.2822   Median :0.08004        
##  Mean   :0.2722   Mean   :0.11461      Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3829   3rd Qu.:0.16140      3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :1.2520   Max.   :0.29100      Max.   :0.6638   Max.   :0.20750

The subset of breast cancer dataframe is created with excluded id and diagnosis.

br_df_s <- subset(br_df, select = -c(id,diagnosis))

br_labels <- subset(br_df, select = c(id,diagnosis))

Principal components are calculated to determine eigenvalues.

pca=princomp(br_df_s, cor=TRUE)
summary(pca)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     3.6443940 2.3856560 1.67867477 1.40735229 1.28402903
## Proportion of Variance 0.4427203 0.1897118 0.09393163 0.06602135 0.05495768
## Cumulative Proportion  0.4427203 0.6324321 0.72636371 0.79238506 0.84734274
##                            Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     1.09879780 0.82171778 0.69037464 0.64567392 0.59219377
## Proportion of Variance 0.04024522 0.02250734 0.01588724 0.01389649 0.01168978
## Cumulative Proportion  0.88758796 0.91009530 0.92598254 0.93987903 0.95156881
##                           Comp.11     Comp.12    Comp.13     Comp.14
## Standard deviation     0.54213992 0.511039500 0.49128148 0.396244525
## Proportion of Variance 0.00979719 0.008705379 0.00804525 0.005233657
## Cumulative Proportion  0.96136600 0.970071383 0.97811663 0.983350291
##                            Comp.15     Comp.16     Comp.17     Comp.18
## Standard deviation     0.306814219 0.282600072 0.243719178 0.229387845
## Proportion of Variance 0.003137832 0.002662093 0.001979968 0.001753959
## Cumulative Proportion  0.986488123 0.989150216 0.991130184 0.992884143
##                            Comp.19     Comp.20      Comp.21      Comp.22
## Standard deviation     0.222435590 0.176520261 0.1731268145 0.1656484305
## Proportion of Variance 0.001649253 0.001038647 0.0009990965 0.0009146468
## Cumulative Proportion  0.994533397 0.995572043 0.9965711397 0.9974857865
##                             Comp.23      Comp.24      Comp.25     Comp.26
## Standard deviation     0.1560155049 0.1343689213 0.1244237573 0.090430304
## Proportion of Variance 0.0008113613 0.0006018336 0.0005160424 0.000272588
## Cumulative Proportion  0.9982971477 0.9988989813 0.9994150237 0.999687612
##                             Comp.27      Comp.28      Comp.29      Comp.30
## Standard deviation     0.0830690308 3.986650e-02 0.0273642668 1.153451e-02
## Proportion of Variance 0.0002300155 5.297793e-05 0.0000249601 4.434827e-06
## Cumulative Proportion  0.9999176271 9.999706e-01 0.9999955652 1.000000e+00

Two Principal components with the greatest eigenvalues explain 63 percent of variation. For considered first 7 components only, it is feasible to obtain 91 percent of explained variation.

res.pca <- PCA(br_df_s, scale.unit = TRUE, graph = FALSE)

print(res.pca)

## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 569 individuals, described by 30 variables
## *The results are available in the following objects:
## 
##    name               description                          
## 1  "$eig"             "eigenvalues"                        
## 2  "$var"             "results for the variables"          
## 3  "$var$coord"       "coord. for the variables"           
## 4  "$var$cor"         "correlations variables - dimensions"
## 5  "$var$cos2"        "cos2 for the variables"             
## 6  "$var$contrib"     "contributions of the variables"     
## 7  "$ind"             "results for the individuals"        
## 8  "$ind$coord"       "coord. for the individuals"         
## 9  "$ind$cos2"        "cos2 for the individuals"           
## 10 "$ind$contrib"     "contributions of the individuals"   
## 11 "$call"            "summary statistics"                 
## 12 "$call$centre"     "mean of the variables"              
## 13 "$call$ecart.type" "standard error of the variables"    
## 14 "$call$row.w"      "weights for the individuals"        
## 15 "$call$col.w"      "weights for the variables"

get_eigenvalue(res.pca)

##          eigenvalue variance.percent cumulative.variance.percent
## Dim.1  1.328161e+01     4.427203e+01                    44.27203
## Dim.2  5.691355e+00     1.897118e+01                    63.24321
## Dim.3  2.817949e+00     9.393163e+00                    72.63637
## Dim.4  1.980640e+00     6.602135e+00                    79.23851
## Dim.5  1.648731e+00     5.495768e+00                    84.73427
## Dim.6  1.207357e+00     4.024522e+00                    88.75880
## Dim.7  6.752201e-01     2.250734e+00                    91.00953
## Dim.8  4.766171e-01     1.588724e+00                    92.59825
## Dim.9  4.168948e-01     1.389649e+00                    93.98790
## Dim.10 3.506935e-01     1.168978e+00                    95.15688
## Dim.11 2.939157e-01     9.797190e-01                    96.13660
## Dim.12 2.611614e-01     8.705379e-01                    97.00714
## Dim.13 2.413575e-01     8.045250e-01                    97.81166
## Dim.14 1.570097e-01     5.233657e-01                    98.33503
## Dim.15 9.413497e-02     3.137832e-01                    98.64881
## Dim.16 7.986280e-02     2.662093e-01                    98.91502
## Dim.17 5.939904e-02     1.979968e-01                    99.11302
## Dim.18 5.261878e-02     1.753959e-01                    99.28841
## Dim.19 4.947759e-02     1.649253e-01                    99.45334
## Dim.20 3.115940e-02     1.038647e-01                    99.55720
## Dim.21 2.997289e-02     9.990965e-02                    99.65711
## Dim.22 2.743940e-02     9.146468e-02                    99.74858
## Dim.23 2.434084e-02     8.113613e-02                    99.82971
## Dim.24 1.805501e-02     6.018336e-02                    99.88990
## Dim.25 1.548127e-02     5.160424e-02                    99.94150
## Dim.26 8.177640e-03     2.725880e-02                    99.96876
## Dim.27 6.900464e-03     2.300155e-02                    99.99176
## Dim.28 1.589338e-03     5.297793e-03                    99.99706
## Dim.29 7.488031e-04     2.496010e-03                    99.99956
## Dim.30 1.330448e-04     4.434827e-04                   100.00000

The positively correlated variables are placed close to each other. In case of negatively correlated values, they are moving towards opposing direction. Based on the graph, it is difficult to distinct the components.

fviz_eig(pca)

fviz_cos2(res.pca, choice = "var", axes = 1:2)

fviz_pca_var(res.pca, col.var = "cos2",
             gradient.cols = c("blue", "yellow", "red"), 
             repel = TRUE)

The PCA is calculated with distincted diagnosis:

pca_res <- prcomp(br_df_s, scale. = TRUE)

autoplot(pca_res, data = br_df, colour = 'diagnosis', label = TRUE, label.size = 3)

pc1 <- fviz_contrib(pca, choice = "var", axes = 1)
pc2 <- fviz_contrib(pca, choice = "var", axes = 2)
grid.arrange(pc1, pc2)

The first 16 principal components in the first dimension are above mean level. In the second dimension, 15 principal component exceed mean eigenvalue level.

T-SNE technique represents structure of a dataset in a manner that close points in the input space will tend to remain close in the low dimensional space.

To perform T-SNE, dataset needs to be standarized.

br_df_z = as.data.frame(sapply(br_df_s, as.numeric))

tsne <- Rtsne(br_df_z, dims = 2, perplexity=30, verbose=TRUE, max_iter = 500)

## Performing PCA
## Read the 569 x 30 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.08 seconds (sparsity = 0.189300)!
## Learning embedding...
## Iteration 50: error is 53.342515 (50 iterations in 0.05 seconds)
## Iteration 100: error is 48.791772 (50 iterations in 0.04 seconds)
## Iteration 150: error is 47.991321 (50 iterations in 0.04 seconds)
## Iteration 200: error is 47.711191 (50 iterations in 0.04 seconds)
## Iteration 250: error is 47.624986 (50 iterations in 0.04 seconds)
## Iteration 300: error is 0.365186 (50 iterations in 0.04 seconds)
## Iteration 350: error is 0.291586 (50 iterations in 0.03 seconds)
## Iteration 400: error is 0.280341 (50 iterations in 0.04 seconds)
## Iteration 450: error is 0.273123 (50 iterations in 0.04 seconds)
## Iteration 500: error is 0.267208 (50 iterations in 0.04 seconds)
## Fitting performed in 0.40 seconds.

colors = rainbow(length(unique(br_df$diagnosis)))
names(colors) = unique(br_df$diagnosis)
par(mgp=c(2.5,1,0))
plot(tsne$Y, t='n', main="tSNE", xlab="tSNE dimension 1", ylab="tSNE dimension 2", "cex.main"=2, "cex.lab"=1.5)
text(tsne$Y, labels=br_df$diagnosis, col=colors[br_df$diagnosis])

The red points represent malignant tumors (cancerous), the blue ones represent bening tumors (non-cancerous). The above graph proves that both types are distinguishable.

Conclusion PCA and T-SNE proves that the tumors might be well distinguished taking in the consideration provided attributes. Only 7 out of 30 variables would be needed to explain 91 percent of variation. Dimension reduction is performs well in analysis of large datasets.

Breast Cancer PCA

Rafal Misiorski

28 02 2022