Features of the Breast Cancer Dataset are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. ‘n’ the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34].

The dataset can be downloaded at Kaggle.

Attribute Information:

  1. ID number 2) Diagnosis (M = malignant, B = benign) 3-32)

Ten real-valued features are computed for each cell nucleus:

  1. radius (mean of distances from center to points on the perimeter) b) texture (standard deviation of gray-scale values) c) perimeter d) area e) smoothness (local variation in radius lengths) f) compactness (perimeter^2 / area - 1.0) g) concavity (severity of concave portions of the contour) h) concave points (number of concave portions of the contour) i) symmetry j) fractal dimension (“coastline approximation” - 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, column 3 is Mean of Radius, coulmn 13 is Radius Standard Error, column 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant.

For our diagnoses, Malignant is represented as 1 while Benign represented as 0

Examine column means

##             radius_mean            texture_mean          perimeter_mean 
##            1.412729e+01            1.928965e+01            9.196903e+01 
##               area_mean         smoothness_mean        compactness_mean 
##            6.548891e+02            9.636028e-02            1.043410e-01 
##          concavity_mean     concave.points_mean           symmetry_mean 
##            8.879932e-02            4.891915e-02            1.811619e-01 
##  fractal_dimension_mean               radius_se              texture_se 
##            6.279761e-02            4.051721e-01            1.216853e+00 
##            perimeter_se                 area_se           smoothness_se 
##            2.866059e+00            4.033708e+01            7.040979e-03 
##          compactness_se            concavity_se       concave.points_se 
##            2.547814e-02            3.189372e-02            1.179614e-02 
##             symmetry_se    fractal_dimension_se            radius_worst 
##            2.054230e-02            3.794904e-03            1.626919e+01 
##           texture_worst         perimeter_worst              area_worst 
##            2.567722e+01            1.072612e+02            8.805831e+02 
##        smoothness_worst       compactness_worst         concavity_worst 
##            1.323686e-01            2.542650e-01            2.721885e-01 
##    concave.points_worst          symmetry_worst fractal_dimension_worst 
##            1.146062e-01            2.900756e-01            8.394582e-02

Examine column standard deviations

##             radius_mean            texture_mean          perimeter_mean 
##            3.524049e+00            4.301036e+00            2.429898e+01 
##               area_mean         smoothness_mean        compactness_mean 
##            3.519141e+02            1.406413e-02            5.281276e-02 
##          concavity_mean     concave.points_mean           symmetry_mean 
##            7.971981e-02            3.880284e-02            2.741428e-02 
##  fractal_dimension_mean               radius_se              texture_se 
##            7.060363e-03            2.773127e-01            5.516484e-01 
##            perimeter_se                 area_se           smoothness_se 
##            2.021855e+00            4.549101e+01            3.002518e-03 
##          compactness_se            concavity_se       concave.points_se 
##            1.790818e-02            3.018606e-02            6.170285e-03 
##             symmetry_se    fractal_dimension_se            radius_worst 
##            8.266372e-03            2.646071e-03            4.833242e+00 
##           texture_worst         perimeter_worst              area_worst 
##            6.146258e+00            3.360254e+01            5.693570e+02 
##        smoothness_worst       compactness_worst         concavity_worst 
##            2.283243e-02            1.573365e-01            2.086243e-01 
##    concave.points_worst          symmetry_worst fractal_dimension_worst 
##            6.573234e-02            6.186747e-02            1.806127e-02

We will execute Principal Component Analysis, since column means and standard deviations varies significantly

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6
## Standard deviation     3.6444 2.3857 1.67867 1.40735 1.28403 1.09880
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025
## Cumulative Proportion  0.4427 0.6324 0.72636 0.79239 0.84734 0.88759
##                            PC7     PC8    PC9    PC10   PC11    PC12
## Standard deviation     0.82172 0.69037 0.6457 0.59219 0.5421 0.51104
## Proportion of Variance 0.02251 0.01589 0.0139 0.01169 0.0098 0.00871
## Cumulative Proportion  0.91010 0.92598 0.9399 0.95157 0.9614 0.97007
##                           PC13    PC14    PC15    PC16    PC17    PC18
## Standard deviation     0.49128 0.39624 0.30681 0.28260 0.24372 0.22939
## Proportion of Variance 0.00805 0.00523 0.00314 0.00266 0.00198 0.00175
## Cumulative Proportion  0.97812 0.98335 0.98649 0.98915 0.99113 0.99288
##                           PC19    PC20   PC21    PC22    PC23   PC24
## Standard deviation     0.22244 0.17652 0.1731 0.16565 0.15602 0.1344
## Proportion of Variance 0.00165 0.00104 0.0010 0.00091 0.00081 0.0006
## Cumulative Proportion  0.99453 0.99557 0.9966 0.99749 0.99830 0.9989
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.12442 0.09043 0.08307 0.03987 0.02736 0.01153
## Proportion of Variance 0.00052 0.00027 0.00023 0.00005 0.00002 0.00000
## Cumulative Proportion  0.99942 0.99969 0.99992 0.99997 1.00000 1.00000

Let’s create a biplot for the transormed data

As we can see from our scree plot, at least 5 principal components is required to explain 80% percent variance of the data.

Component of loading vectors of features for the first 3 principal components

##                                 PC1          PC2          PC3
## radius_mean             -0.21890244  0.233857132 -0.008531243
## texture_mean            -0.10372458  0.059706088  0.064549903
## perimeter_mean          -0.22753729  0.215181361 -0.009314220
## area_mean               -0.22099499  0.231076711  0.028699526
## smoothness_mean         -0.14258969 -0.186113023 -0.104291904
## compactness_mean        -0.23928535 -0.151891610 -0.074091571
## concavity_mean          -0.25840048 -0.060165363  0.002733838
## concave.points_mean     -0.26085376  0.034767500 -0.025563541
## symmetry_mean           -0.13816696 -0.190348770 -0.040239936
## fractal_dimension_mean  -0.06436335 -0.366575471 -0.022574090
## radius_se               -0.20597878  0.105552152  0.268481387
## texture_se              -0.01742803 -0.089979682  0.374633665
## perimeter_se            -0.21132592  0.089457234  0.266645367
## area_se                 -0.20286964  0.152292628  0.216006528
## smoothness_se           -0.01453145 -0.204430453  0.308838979
## compactness_se          -0.17039345 -0.232715896  0.154779718
## concavity_se            -0.15358979 -0.197207283  0.176463743
## concave.points_se       -0.18341740 -0.130321560  0.224657567
## symmetry_se             -0.04249842 -0.183848000  0.288584292
## fractal_dimension_se    -0.10256832 -0.280092027  0.211503764
## radius_worst            -0.22799663  0.219866379 -0.047506990
## texture_worst           -0.10446933  0.045467298 -0.042297823
## perimeter_worst         -0.23663968  0.199878428 -0.048546508
## area_worst              -0.22487053  0.219351858 -0.011902318
## smoothness_worst        -0.12795256 -0.172304352 -0.259797613
## compactness_worst       -0.21009588 -0.143593173 -0.236075625
## concavity_worst         -0.22876753 -0.097964114 -0.173057335
## concave.points_worst    -0.25088597  0.008257235 -0.170344076
## symmetry_worst          -0.12290456 -0.141883349 -0.271312642
## fractal_dimension_worst -0.13178394 -0.275339469 -0.232791313

We will cut the tree at height 20, so that it has 4 clusters.

Now let’s compare cluster membership to actual diagnoses

##                     diagnosis
## wisc.hclust.clusters   0   1
##                    1  12 165
##                    2   2   5
##                    3 343  40
##                    4   0   2
# Create a k-means model on wisc.data: wisc.km
wisc.km <- kmeans(scale(wisc.data), centers = 2, nstart = 20)

Let’s compare k-means to actual diagnoses

##    diagnosis
##       0   1
##   1  14 175
##   2 343  37

Let’s compare k-means to hierarchical clustering

##                     
## wisc.hclust.clusters   1   2
##                    1 160  17
##                    2   7   0
##                    3  20 363
##                    4   2   0

Our table suggests, clusters 1, 2, and 4 from the hierarchical clustering model can be interpreted as the cluster 1 equivalent from the k-means algorithm, and cluster 3 can be interpreted as the cluster 2 equivalent.

# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")

# Cut model into 4 clusters: wisc.pr.hclust.clusters
wisc.pr.hclust.clusters <- cutree(wisc.pr.hclust, k = 4)

Compare hierarchical clustering model to actual diagnoses

##          wisc.pr.hclust.clusters
## diagnosis   1   2   3   4
##         0   5 350   2   0
##         1 113  97   0   2

Compare actual diagnoses to k-means and hierarchical model

##          wisc.hclust.clusters
## diagnosis   1   2   3   4
##         0  12   2 343   0
##         1 165   5  40   2
##          
## diagnosis   1   2
##         0  14 343
##         1 175  37