Tumor Classifying via Methods of Unsupervised Learning

Problem

Construct a classifier that predicts whether the tumor is malignant (M), or benign (B) using methods of unsupervised learning and data from accompanying file data.csv. It is known there are more benign observations.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

Data

data.csv is a database containing data on 569 breast masses. The database contains the following information:

1	id	observation identification label
2	radius	mean of distances from center to points on the perimeter
3	texture	standard deviation of gray-scale values
4	perimeter	perimeter of the mass
5	area	area of the mass
6	smoothness	local variation in radius lengths
7	compactness	\(\dfrac{perimeter^2}{area} - 1\)
8	concavity	severity of concave portions of the contour
9	concave points	number of concave portions of the contour
10	symmetry	symmetry of the mass
12	fractal dimension	“coastline approximation” - 1

One possible solution

data <- read.csv("data.csv") #loading the data
id <- data$id # saving the id column, since we need it for the result data frame
data <- data[, -c(1)] #removing the id column from our data
summary(data)

##   radius_mean      texture_mean   perimeter_mean     area_mean     
##  Min.   : 6.981   Min.   : 9.71   Min.   : 43.79   Min.   : 143.5  
##  1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17   1st Qu.: 420.3  
##  Median :13.370   Median :18.84   Median : 86.24   Median : 551.1  
##  Mean   :14.127   Mean   :19.29   Mean   : 91.97   Mean   : 654.9  
##  3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10   3rd Qu.: 782.7  
##  Max.   :28.110   Max.   :39.28   Max.   :188.50   Max.   :2501.0  
##  smoothness_mean   compactness_mean  concavity_mean    concave.points_mean
##  Min.   :0.05263   Min.   :0.01938   Min.   :0.00000   Min.   :0.00000    
##  1st Qu.:0.08637   1st Qu.:0.06492   1st Qu.:0.02956   1st Qu.:0.02031    
##  Median :0.09587   Median :0.09263   Median :0.06154   Median :0.03350    
##  Mean   :0.09636   Mean   :0.10434   Mean   :0.08880   Mean   :0.04892    
##  3rd Qu.:0.10530   3rd Qu.:0.13040   3rd Qu.:0.13070   3rd Qu.:0.07400    
##  Max.   :0.16340   Max.   :0.34540   Max.   :0.42680   Max.   :0.20120    
##  symmetry_mean    fractal_dimension_mean   radius_se        texture_se    
##  Min.   :0.1060   Min.   :0.04996        Min.   :0.1115   Min.   :0.3602  
##  1st Qu.:0.1619   1st Qu.:0.05770        1st Qu.:0.2324   1st Qu.:0.8339  
##  Median :0.1792   Median :0.06154        Median :0.3242   Median :1.1080  
##  Mean   :0.1812   Mean   :0.06280        Mean   :0.4052   Mean   :1.2169  
##  3rd Qu.:0.1957   3rd Qu.:0.06612        3rd Qu.:0.4789   3rd Qu.:1.4740  
##  Max.   :0.3040   Max.   :0.09744        Max.   :2.8730   Max.   :4.8850  
##   perimeter_se       area_se        smoothness_se      compactness_se    
##  Min.   : 0.757   Min.   :  6.802   Min.   :0.001713   Min.   :0.002252  
##  1st Qu.: 1.606   1st Qu.: 17.850   1st Qu.:0.005169   1st Qu.:0.013080  
##  Median : 2.287   Median : 24.530   Median :0.006380   Median :0.020450  
##  Mean   : 2.866   Mean   : 40.337   Mean   :0.007041   Mean   :0.025478  
##  3rd Qu.: 3.357   3rd Qu.: 45.190   3rd Qu.:0.008146   3rd Qu.:0.032450  
##  Max.   :21.980   Max.   :542.200   Max.   :0.031130   Max.   :0.135400  
##   concavity_se     concave.points_se   symmetry_se       fractal_dimension_se
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.007882   Min.   :0.0008948   
##  1st Qu.:0.01509   1st Qu.:0.007638   1st Qu.:0.015160   1st Qu.:0.0022480   
##  Median :0.02589   Median :0.010930   Median :0.018730   Median :0.0031870   
##  Mean   :0.03189   Mean   :0.011796   Mean   :0.020542   Mean   :0.0037949   
##  3rd Qu.:0.04205   3rd Qu.:0.014710   3rd Qu.:0.023480   3rd Qu.:0.0045580   
##  Max.   :0.39600   Max.   :0.052790   Max.   :0.078950   Max.   :0.0298400   
##   radius_worst   texture_worst   perimeter_worst    area_worst    
##  Min.   : 7.93   Min.   :12.02   Min.   : 50.41   Min.   : 185.2  
##  1st Qu.:13.01   1st Qu.:21.08   1st Qu.: 84.11   1st Qu.: 515.3  
##  Median :14.97   Median :25.41   Median : 97.66   Median : 686.5  
##  Mean   :16.27   Mean   :25.68   Mean   :107.26   Mean   : 880.6  
##  3rd Qu.:18.79   3rd Qu.:29.72   3rd Qu.:125.40   3rd Qu.:1084.0  
##  Max.   :36.04   Max.   :49.54   Max.   :251.20   Max.   :4254.0  
##  smoothness_worst  compactness_worst concavity_worst  concave.points_worst
##  Min.   :0.07117   Min.   :0.02729   Min.   :0.0000   Min.   :0.00000     
##  1st Qu.:0.11660   1st Qu.:0.14720   1st Qu.:0.1145   1st Qu.:0.06493     
##  Median :0.13130   Median :0.21190   Median :0.2267   Median :0.09993     
##  Mean   :0.13237   Mean   :0.25427   Mean   :0.2722   Mean   :0.11461     
##  3rd Qu.:0.14600   3rd Qu.:0.33910   3rd Qu.:0.3829   3rd Qu.:0.16140     
##  Max.   :0.22260   Max.   :1.05800   Max.   :1.2520   Max.   :0.29100     
##  symmetry_worst   fractal_dimension_worst
##  Min.   :0.1565   Min.   :0.05504        
##  1st Qu.:0.2504   1st Qu.:0.07146        
##  Median :0.2822   Median :0.08004        
##  Mean   :0.2901   Mean   :0.08395        
##  3rd Qu.:0.3179   3rd Qu.:0.09208        
##  Max.   :0.6638   Max.   :0.20750

set.seed(123)
# our data need to be scaled, especially for the PCA algorithm
data.scaled <- as.data.frame(scale(data))

Principal Component Analysis (PCA)

PCA is the algorithm for dimension-reduction. It returns linear transformation of our data in such way that the most variance is preserved in less predictors. It needs to be performed on scaled data!

library(factoextra) # used for pretty plots

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

pc <- princomp(data.scaled) # pca model
summary(pc)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     3.6411901 2.3835587 1.67719901 1.40611506 1.28290021
## Proportion of Variance 0.4427203 0.1897118 0.09393163 0.06602135 0.05495768
## Cumulative Proportion  0.4427203 0.6324321 0.72636371 0.79238506 0.84734274
##                            Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     1.09783183 0.82099539 0.68976771 0.64510630 0.59167316
## Proportion of Variance 0.04024522 0.02250734 0.01588724 0.01389649 0.01168978
## Cumulative Proportion  0.88758796 0.91009530 0.92598254 0.93987903 0.95156881
##                           Comp.11     Comp.12    Comp.13     Comp.14
## Standard deviation     0.54166332 0.510590234 0.49084959 0.395896178
## Proportion of Variance 0.00979719 0.008705379 0.00804525 0.005233657
## Cumulative Proportion  0.96136600 0.970071383 0.97811663 0.983350291
##                            Comp.15     Comp.16     Comp.17     Comp.18
## Standard deviation     0.306544492 0.282351633 0.243504919 0.229186185
## Proportion of Variance 0.003137832 0.002662093 0.001979968 0.001753959
## Cumulative Proportion  0.986488123 0.989150216 0.991130184 0.992884143
##                            Comp.19     Comp.20      Comp.21      Comp.22
## Standard deviation     0.222240042 0.176365078 0.1729746151 0.1655028055
## Proportion of Variance 0.001649253 0.001038647 0.0009990965 0.0009146468
## Cumulative Proportion  0.994533397 0.995572043 0.9965711397 0.9974857865
##                             Comp.23      Comp.24      Comp.25     Comp.26
## Standard deviation     0.1558783484 0.1342507947 0.1243143737 0.090350805
## Proportion of Variance 0.0008113613 0.0006018336 0.0005160424 0.000272588
## Cumulative Proportion  0.9982971477 0.9988989813 0.9994150237 0.999687612
##                             Comp.27      Comp.28      Comp.29      Comp.30
## Standard deviation     0.0829960030 3.983145e-02 0.0273402103 1.152437e-02
## Proportion of Variance 0.0002300155 5.297793e-05 0.0000249601 4.434827e-06
## Cumulative Proportion  0.9999176271 9.999706e-01 0.9999955652 1.000000e+00

fviz_eig(pc) # plots percentage of explained variance via the components

data.PCA <- data.frame(pc$scores[, c(1, 2, 3, 4)])

In the summary and plot of the PCA model we can see that > 80% of the variance is explained by the first 4 principal components. That is the reason why we will continue to work on data.PCA.

K-means

We can compute k-means in R with the kmeans function. Here, we will group the data into two clusters (centers = 2).

Firstly, we are going to show that the number of clusters that minimizes the distance within the clusters is exactly 2, since k = 2 is the number of clusters we need (mass is B or M), using the silhouette method. Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides information on how well each object has been classified. We can use the silhouette function in the cluster package to compute the average silhouette width. The following code computes this approach for 2-10 clusters. The results show that 2 clusters maximize the average silhouette values.

Furthermore, the kmeans function also has a nstart option that attempts multiple initial configurations and reports on the best one. Eg. adding nstart = 20 will generate 20 initial configurations. This approach is often recommended.

# library(cluster)
# 
# silhouette <- rep(0, 9)
# for (i in 2:10) {
#   model = kmeans(data.scaled, centers = i, nstart = 20)
#   silhouettes = silhouette(model$cluster, dist = dist(data.scaled))
#   silhouette[i] = mean(silhouettes[, 3])
# }
# plot(silhouette, type = 'b')
# abline(v = which.max(silhouette), lty = 2)

# the commented code above can be executed using the fviz_nbclust
# function from factoextra library
fviz_nbclust(data, kmeans, method = "silhouette")

After computing our k-means model, we are going to reformat our clusters’ names, since we want the output in the form M/B, rather than 1/2. Here, we are using the information that there is more benign observations.

model.kmeans.PCA = kmeans(data.PCA, centers = 2, nstart = 20) # computing kmeans model

# model.kmeans$size is a vector of length 2, indicating the sizes of the clusters.
# The first number is the number of observations labeled as "1",
# and the second number is the number of observations labeled as "2"
if(model.kmeans.PCA$size[1] > model.kmeans.PCA$size[2]){
  model.kmeans.PCA$cluster[model.kmeans.PCA$cluster == 1] <- "B"
  model.kmeans.PCA$cluster[model.kmeans.PCA$cluster == 2] <- "M"
}
if(model.kmeans.PCA$size[1] < model.kmeans.PCA$size[2]){
 model.kmeans.PCA$cluster[model.kmeans.PCA$cluster == 1] <- "M"
 model.kmeans.PCA$cluster[model.kmeans.PCA$cluster == 2] <- "B"
}

In the end, we are going to write the results in the .csv file.

result <- as.data.frame(cbind(id, model.kmeans.PCA$cluster))
names(result) <- c("id", "prediction")
write.csv(result, "result.csv")

Visualize clusters

We can visualize our results by using fviz_cluster from factoextra library. This provides a nice illustration of the clusters. If there are more than two variables fviz_cluster will perform principal component analysis (PCA) and plot the data points according to the first two principal components, since they explain the majority of the variance.

fviz_pca_ind(pc,
             geom.ind = "point",
             pointshape = 21,
             pointsize = 2,
             fill.ind = model.kmeans.PCA$cluster,
             palette = c("#2E9FDF", "#00AFBB"),
             addEllipses = TRUE,
             legend.title = "Prediction") + 
  ggtitle("2D PCA-plot from 30 feature dataset") +
  theme(plot.title = element_text(hjust = 0.5))