The goal of this chapter is to guide you through a complete analysis using the unsupervised learning techniques covered in the first three chapters. You’ll extend what you’ve learned by combining PCA as a preprocessing step to clustering using data that consist of measurements of cell nuclei of human breast masses.
Data from paper by Bennet and Mangasarian. “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”
Human brest mass that was or was not malignant
10 features measured of each cell nuclei (I see 30 though) each features is a summary statistic of the cells in that mass includes diagnosis (target) - can be used for supervised learning but will not be used during the unsupervised analysis
Overall steps 1. download and prepare data 2. EDA 3. perform PCA and interpret results 4. complete two tupes of clustering 5. understnd and compare the two types of clustering 6. combine PCA and clustering
library(readr)
library(dplyr)
library(ggplot2)
library(stringr)
Unlike prior chapters, where we prepared the data for you for unsupervised learning, the goal of this chapter is to step you through a more realistic and complete workflow.
Recall from the video that the first step is to download and prepare the data.
Instructions
100 XP
Use read.csv() function to download the CSV (comma-separated values) file containing the data from the URL provided. Assign the result to wisc.df.
Use as.matrix() to convert the features of the data (in columns 3 through 32) to a matrix. Store this in a variable called wisc.data.
Assign the row names of wisc.data the values currently contained in the id column of wisc.df. While not strictly required, this will help you keep track of the different observations throughout the modeling process.
Finally, set a vector called diagnosis to be 1 if a diagnosis is malignant (“M”) and 0 otherwise. Note that R coerces TRUE to 1 and FALSE to 0.
url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"
# Download the data: wisc.df
wisc.df <- read.csv(url)
str(wisc.df)
'data.frame': 569 obs. of 33 variables:
$ id : int 842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
$ diagnosis : Factor w/ 2 levels "B","M": 2 2 2 2 2 2 2 2 2 2 ...
$ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
$ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
$ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
$ area_mean : num 1001 1326 1203 386 1297 ...
$ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
$ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
$ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
$ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
$ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
$ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
$ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
$ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
$ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
$ area_se : num 153.4 74.1 94 27.2 94.4 ...
$ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
$ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
$ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
$ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
$ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
$ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
$ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
$ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
$ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
$ area_worst : num 2019 1956 1709 568 1575 ...
$ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
$ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
$ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
$ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
$ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
$ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
$ X : logi NA NA NA NA NA NA ...
# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[, 3:32])
# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id
str(wisc.data)
num [1:569, 1:30] 18 20.6 19.7 11.4 20.3 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:569] "842302" "842517" "84300903" "84348301" ...
..$ : chr [1:30] "radius_mean" "texture_mean" "perimeter_mean" "area_mean" ...
# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis =="M")
The first step of any data analysis, unsupervised or supervised, is to familiarize yourself with the data.
The variables you created before, wisc.data and diagnosis, are still available in your workspace. Explore the data to answer the following questions:
How many observations are in this dataset?
How many variables/features in the data are suffixed with _mean?
How many of the observations have a malignant diagnosis?
Instructions
50 XP
Possible Answers
569, 5, 112
30, 10, 212
569, 10, 212 [ans]
30, 5, 112
# 1. For counting many observations are in this dataset wisc.data
nrow(wisc.data)
[1] 569
# 2. For counting number of occurences of substring '_mean' in wisc.data
library(stringr)
cn<-colnames(wisc.data, do.NULL = TRUE, prefix = "col")
sum(str_count(cn, "_mean"))
[1] 10
# 3. For counting number of magliant occurences in diagnosis vector containing only (1- maglinant, 0 otherwise)
sum(diagnosis)
[1] 212
Now you’ll use some visualizations to better understand your PCA model. You were introduced to one of these visualizations, the biplot, in an earlier chapter.
You’ll run into some common challenges with using biplots on real-world data containing a non-trivial number of observations and variables, then you’ll look at some alternative visualizations. You are encouraged to experiment with additional visualizations before moving on to the next exercise.
Instructions
100 XP
The variables you created before, wisc.data, diagnosis, and wisc.pr, are still available.
Create a biplot of the wisc.pr data. What stands out to you about this plot? Is it easy or difficult to understand? Why?
Execute the code to scatter plot each observation by principal components 1 and 2, coloring the points by the diagnosis.
Repeat the same for principal components 1 and 3. What do you notice about these plots?
# Check column means and standard deviations
round(colMeans(wisc.data), 2)
radius_mean texture_mean perimeter_mean area_mean
14.13 19.29 91.97 654.89
smoothness_mean compactness_mean concavity_mean concave.points_mean
0.10 0.10 0.09 0.05
symmetry_mean fractal_dimension_mean radius_se texture_se
0.18 0.06 0.41 1.22
perimeter_se area_se smoothness_se compactness_se
2.87 40.34 0.01 0.03
concavity_se concave.points_se symmetry_se fractal_dimension_se
0.03 0.01 0.02 0.00
radius_worst texture_worst perimeter_worst area_worst
16.27 25.68 107.26 880.58
smoothness_worst compactness_worst concavity_worst concave.points_worst
0.13 0.25 0.27 0.11
symmetry_worst fractal_dimension_worst
0.29 0.08
round(apply(wisc.data, 2, sd), 2)
radius_mean texture_mean perimeter_mean area_mean
3.52 4.30 24.30 351.91
smoothness_mean compactness_mean concavity_mean concave.points_mean
0.01 0.05 0.08 0.04
symmetry_mean fractal_dimension_mean radius_se texture_se
0.03 0.01 0.28 0.55
perimeter_se area_se smoothness_se compactness_se
2.02 45.49 0.00 0.02
concavity_se concave.points_se symmetry_se fractal_dimension_se
0.03 0.01 0.01 0.00
radius_worst texture_worst perimeter_worst area_worst
4.83 6.15 33.60 569.36
smoothness_worst compactness_worst concavity_worst concave.points_worst
0.02 0.16 0.21 0.07
symmetry_worst fractal_dimension_worst
0.06 0.02
# Execute PCA, scaling if appropriate: wisc.pr
wisc.pr<- prcomp(wisc.data, scale = T, center = T)
# Look at summary of results
summary(wisc.pr)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Standard deviation 3.6444 2.3857 1.67867 1.40735 1.28403 1.09880 0.82172 0.69037 0.6457 0.59219
Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025 0.02251 0.01589 0.0139 0.01169
Cumulative Proportion 0.4427 0.6324 0.72636 0.79239 0.84734 0.88759 0.91010 0.92598 0.9399 0.95157
PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19
Standard deviation 0.5421 0.51104 0.49128 0.39624 0.30681 0.28260 0.24372 0.22939 0.22244
Proportion of Variance 0.0098 0.00871 0.00805 0.00523 0.00314 0.00266 0.00198 0.00175 0.00165
Cumulative Proportion 0.9614 0.97007 0.97812 0.98335 0.98649 0.98915 0.99113 0.99288 0.99453
PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28
Standard deviation 0.17652 0.1731 0.16565 0.15602 0.1344 0.12442 0.09043 0.08307 0.03987
Proportion of Variance 0.00104 0.0010 0.00091 0.00081 0.0006 0.00052 0.00027 0.00023 0.00005
Cumulative Proportion 0.99557 0.9966 0.99749 0.99830 0.9989 0.99942 0.99969 0.99992 0.99997
PC29 PC30
Standard deviation 0.02736 0.01153
Proportion of Variance 0.00002 0.00000
Cumulative Proportion 1.00000 1.00000
Now you’ll use some visualizations to better understand your PCA model. You were introduced to one of these visualizations, the biplot, in an earlier chapter.
You’ll run into some common challenges with using biplots on real-world data containing a non-trivial number of observations and variables, then you’ll look at some alternative visualizations. You are encouraged to experiment with additional visualizations before moving on to the next exercise.
Instructions
100 XP
The variables you created before, wisc.data, diagnosis, and wisc.pr, are still available.
Create a biplot of the wisc.pr data. What stands out to you about this plot? Is it easy or difficult to understand? Why?
Execute the code to scatter plot each observation by principal components 1 and 2, coloring the points by the diagnosis.
Repeat the same for principal components 1 and 3. What do you notice about these plots?
# Create a biplot of wisc.pr
biplot(wisc.pr)
# Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1),
xlab = "PC1", ylab = "PC2")
# Repeat for components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1),
xlab = "PC1", ylab = "PC3")
# Do additional data exploration of your choosing below (optional)
plot(wisc.pr$x[, c(2, 3)], col = (diagnosis + 1),
xlab = "PC2", ylab = "PC3")
In this exercise, you will produce scree plots showing the proportion of variance explained as the number of principal components increases. The data from PCA must be prepared for these plots, as there is not a built-in function in R to create them directly from the PCA model.
As you look at these plots, ask yourself if there’s an elbow in the amount of variance explained that might lead you to pick a natural number of principal components. If an obvious elbow does not exist, as is typical in real-world datasets, consider how else you might determine the number of principal components to retain based on the scree plot.
Instructions
100 XP
The variables you created before, wisc.data, diagnosis, and wisc.pr, are still available.
Calculate the variance of each principal component by squaring the sdev component of wisc.pr. Save the result as an object called pr.var.
Calculate the variance explained by each principal component by dividing by the total variance explained of all principal components. Assign this to a variable called pve.
Create a plot of variance explained for each principal component.
Using the cumsum() function, create a plot of cumulative proportion of variance explained.
# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))
# Calculate variability of each component
pr.var<-(wisc.pr$sdev)^2
# Variance explained by each principal component: pve
pve<-pr.var/sum(pr.var)
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
This exercise will check your understanding of the PCA results, in particular the loadings and variance explained. The loadings, represented as vectors, explain the mapping from the original features to the principal components. The principal components are naturally ordered from the most variance explained to the least variance explained.
The variables you created before—wisc.data, diagnosis, wisc.pr, and pve—are still available.
For the first principal component, what is the component of the loading vector for the feature concave.points_mean? What is the minimum number of principal components required to explain 80% of the variance of the data?
Instructions
50 XP
Possible Answers
-0.26085376, 5 [ans]
-0.25088597, 2
0.034767500, 4
0.26085376, 5
# For 1st part: we use the rotational pca command & pick the ["concave.points_mean", PC1] entry = -0.26085376
wisc.pr$rotation
PC1 PC2 PC3 PC4 PC5
radius_mean -0.21890244 0.233857132 -0.008531243 0.041408962 -0.037786354
texture_mean -0.10372458 0.059706088 0.064549903 -0.603050001 0.049468850
perimeter_mean -0.22753729 0.215181361 -0.009314220 0.041983099 -0.037374663
area_mean -0.22099499 0.231076711 0.028699526 0.053433795 -0.010331251
smoothness_mean -0.14258969 -0.186113023 -0.104291904 0.159382765 0.365088528
compactness_mean -0.23928535 -0.151891610 -0.074091571 0.031794581 -0.011703971
concavity_mean -0.25840048 -0.060165363 0.002733838 0.019122753 -0.086375412
concave.points_mean -0.26085376 0.034767500 -0.025563541 0.065335944 0.043861025
symmetry_mean -0.13816696 -0.190348770 -0.040239936 0.067124984 0.305941428
fractal_dimension_mean -0.06436335 -0.366575471 -0.022574090 0.048586765 0.044424360
radius_se -0.20597878 0.105552152 0.268481387 0.097941242 0.154456496
texture_se -0.01742803 -0.089979682 0.374633665 -0.359855528 0.191650506
perimeter_se -0.21132592 0.089457234 0.266645367 0.088992415 0.120990220
area_se -0.20286964 0.152292628 0.216006528 0.108205039 0.127574432
smoothness_se -0.01453145 -0.204430453 0.308838979 0.044664180 0.232065676
compactness_se -0.17039345 -0.232715896 0.154779718 -0.027469363 -0.279968156
concavity_se -0.15358979 -0.197207283 0.176463743 0.001316880 -0.353982091
concave.points_se -0.18341740 -0.130321560 0.224657567 0.074067335 -0.195548089
symmetry_se -0.04249842 -0.183848000 0.288584292 0.044073351 0.252868765
fractal_dimension_se -0.10256832 -0.280092027 0.211503764 0.015304750 -0.263297438
radius_worst -0.22799663 0.219866379 -0.047506990 0.015417240 0.004406592
texture_worst -0.10446933 0.045467298 -0.042297823 -0.632807885 0.092883400
perimeter_worst -0.23663968 0.199878428 -0.048546508 0.013802794 -0.007454151
area_worst -0.22487053 0.219351858 -0.011902318 0.025894749 0.027390903
smoothness_worst -0.12795256 -0.172304352 -0.259797613 0.017652216 0.324435445
compactness_worst -0.21009588 -0.143593173 -0.236075625 -0.091328415 -0.121804107
concavity_worst -0.22876753 -0.097964114 -0.173057335 -0.073951180 -0.188518727
concave.points_worst -0.25088597 0.008257235 -0.170344076 0.006006996 -0.043332069
symmetry_worst -0.12290456 -0.141883349 -0.271312642 -0.036250695 0.244558663
fractal_dimension_worst -0.13178394 -0.275339469 -0.232791313 -0.077053470 -0.094423351
PC6 PC7 PC8 PC9 PC10
radius_mean 0.0187407904 -0.1240883403 0.007452296 -0.223109764 0.095486443
texture_mean -0.0321788366 0.0113995382 -0.130674825 0.112699390 0.240934066
perimeter_mean 0.0173084449 -0.1144770573 0.018687258 -0.223739213 0.086385615
area_mean -0.0018877480 -0.0516534275 -0.034673604 -0.195586014 0.074956489
smoothness_mean -0.2863744966 -0.1406689928 0.288974575 0.006424722 -0.069292681
compactness_mean -0.0141309489 0.0309184960 0.151396350 -0.167841425 0.012936200
concavity_mean -0.0093441809 -0.1075204434 0.072827285 0.040591006 -0.135602298
concave.points_mean -0.0520499505 -0.1504822142 0.152322414 -0.111971106 0.008054528
symmetry_mean 0.3564584607 -0.0938911345 0.231530989 0.256040084 0.572069479
fractal_dimension_mean -0.1194306679 0.2957600240 0.177121441 -0.123740789 0.081103207
radius_se -0.0256032561 0.3124900373 -0.022539967 0.249985002 -0.049547594
texture_se -0.0287473145 -0.0907553556 0.475413139 -0.246645397 -0.289142742
perimeter_se 0.0018107150 0.3146403902 0.011896690 0.227154024 -0.114508236
area_se -0.0428639079 0.3466790028 -0.085805135 0.229160015 -0.091927889
smoothness_se -0.3429173935 -0.2440240556 -0.573410232 -0.141924890 0.160884609
compactness_se 0.0691975186 0.0234635340 -0.117460157 -0.145322810 0.043504866
concavity_se 0.0563432386 -0.2088237897 -0.060566501 0.358107079 -0.141276243
concave.points_se -0.0312244482 -0.3696459369 0.108319309 0.272519886 0.086240847
symmetry_se 0.4902456426 -0.0803822539 -0.220149279 -0.304077200 -0.316529830
fractal_dimension_se -0.0531952674 0.1913949726 -0.011168188 -0.213722716 0.367541918
radius_worst -0.0002906849 -0.0097099360 -0.042619416 -0.112141463 0.077361643
texture_worst -0.0500080613 0.0098707439 -0.036251636 0.103341204 0.029550941
perimeter_worst 0.0085009872 -0.0004457267 -0.030558534 -0.109614364 0.050508334
area_worst -0.0251643821 0.0678316595 -0.079394246 -0.080732461 0.069921152
smoothness_worst -0.3692553703 -0.1088308865 -0.205852191 0.112315904 -0.128304659
compactness_worst 0.0477057929 0.1404729381 -0.084019659 -0.100677822 -0.172133632
concavity_worst 0.0283792555 -0.0604880561 -0.072467871 0.161908621 -0.311638520
concave.points_worst -0.0308734498 -0.1679666187 0.036170795 0.060488462 -0.076648291
symmetry_worst 0.4989267845 -0.0184906298 -0.228225053 0.064637806 -0.029563075
fractal_dimension_worst -0.0802235245 0.3746576261 -0.048360667 -0.134174175 0.012609579
PC11 PC12 PC13 PC14 PC15 PC16
radius_mean -0.04147149 0.051067457 0.01196721 0.059506135 -0.051118775 -0.15058388
texture_mean 0.30224340 0.254896423 0.20346133 -0.021560100 -0.107922421 -0.15784196
perimeter_mean -0.01678264 0.038926106 0.04410950 0.048513812 -0.039902936 -0.11445396
area_mean -0.11016964 0.065437508 0.06737574 0.010830829 0.013966907 -0.13244803
smoothness_mean 0.13702184 0.316727211 0.04557360 0.445064860 -0.118143364 -0.20461325
compactness_mean 0.30800963 -0.104017044 0.22928130 0.008101057 0.230899962 0.17017837
concavity_mean -0.12419024 0.065653480 0.38709081 -0.189358699 -0.128283732 0.26947021
concave.points_mean 0.07244603 0.042589267 0.13213810 -0.244794768 -0.217099194 0.38046410
symmetry_mean -0.16305408 -0.288865504 0.18993367 0.030738856 -0.073961707 -0.16466159
fractal_dimension_mean 0.03804827 0.236358988 0.10623908 -0.377078865 0.517975705 -0.04079279
radius_se 0.02535702 -0.016687915 -0.06819523 0.010347413 -0.110050711 0.05890572
texture_se -0.34494446 -0.306160423 -0.16822238 -0.010849347 0.032752721 -0.03450040
perimeter_se 0.16731877 -0.101446828 -0.03784399 -0.045523718 -0.008268089 0.02651665
area_se -0.05161946 -0.017679218 0.05606493 0.083570718 -0.046024366 0.04115323
smoothness_se -0.08420621 -0.294710053 0.15044143 -0.201152530 0.018559465 -0.05803906
compactness_se 0.20688568 -0.263456509 0.01004017 0.491755932 0.168209315 0.18983090
concavity_se -0.34951794 0.251146975 0.15878319 0.134586924 0.250471408 -0.12542065
concave.points_se 0.34237591 -0.006458751 -0.49402674 -0.199666719 0.062079344 -0.19881035
symmetry_se 0.18784404 0.320571348 0.01033274 -0.046864383 -0.113383199 -0.15771150
fractal_dimension_se -0.25062479 0.276165974 -0.24045832 0.145652466 -0.353232211 0.26855388
radius_worst -0.10506733 0.039679665 -0.13789053 0.023101281 0.166567074 -0.08156057
texture_worst -0.01315727 0.079797450 -0.08014543 0.053430792 0.101115399 0.18555785
perimeter_worst -0.05107628 -0.008987738 -0.09696571 0.012219382 0.182755198 -0.05485705
area_worst -0.18459894 0.048088657 -0.10116061 -0.006685465 0.314993600 -0.09065339
smoothness_worst -0.14389035 0.056514866 -0.20513034 0.162235443 0.046125866 0.14555166
compactness_worst 0.19742047 -0.371662503 0.01227931 0.166470250 -0.049956014 -0.15373486
concavity_worst -0.18501676 -0.087034532 0.21798433 -0.066798931 -0.204835886 -0.21502195
concave.points_worst 0.11777205 -0.068125354 -0.25438749 -0.276418891 -0.169499607 0.17814174
symmetry_worst -0.15756025 0.044033503 -0.25653491 0.005355574 0.139888394 0.25789401
fractal_dimension_worst -0.11828355 -0.034731693 -0.17281424 -0.212104110 -0.256173195 -0.40555649
PC17 PC18 PC19 PC20 PC21
radius_mean 0.202924255 0.1467123385 0.22538466 -0.049698664 -0.0685700057
texture_mean -0.038706119 -0.0411029851 0.02978864 -0.244134993 0.4483694667
perimeter_mean 0.194821310 0.1583174548 0.23959528 -0.017665012 -0.0697690429
area_mean 0.255705763 0.2661681046 -0.02732219 -0.090143762 -0.0184432785
smoothness_mean 0.167929914 -0.3522268017 -0.16456584 0.017100960 -0.1194917473
compactness_mean -0.020307708 0.0077941384 0.28422236 0.488686329 0.1926213963
concavity_mean -0.001598353 -0.0269681105 0.00226636 -0.033387086 0.0055717533
concave.points_mean 0.034509509 -0.0828277367 -0.15497236 -0.235407606 -0.0094238187
symmetry_mean -0.191737848 0.1733977905 -0.05881116 0.026069156 -0.0869384844
fractal_dimension_mean 0.050225246 0.0878673570 -0.05815705 -0.175637222 -0.0762718362
radius_se -0.139396866 -0.2362165319 0.17588331 -0.090800503 0.0863867747
texture_se 0.043963016 -0.0098586620 0.03600985 -0.071659988 0.2170719674
perimeter_se -0.024635639 -0.0259288003 0.36570154 -0.177250625 -0.3049501584
area_se 0.334418173 0.3049069032 -0.41657231 0.274201148 0.1925877857
smoothness_se 0.139595006 -0.2312599432 -0.01326009 0.090061477 -0.0720987261
compactness_se -0.008246477 0.1004742346 -0.24244818 -0.461098220 -0.1403865724
concavity_se 0.084616716 -0.0001954852 0.12638102 0.066946174 0.0630479298
concave.points_se 0.108132263 0.0460549116 -0.01216430 0.068868294 0.0343753236
symmetry_se -0.274059129 0.1870147640 -0.08903929 0.107385289 -0.0976995265
fractal_dimension_se -0.122733398 -0.0598230982 0.08660084 0.222345297 0.0628432814
radius_worst -0.240049982 -0.2161013526 0.01366130 -0.005626909 0.0072938995
texture_worst 0.069365185 0.0583984505 -0.07586693 0.300599798 -0.5944401434
perimeter_worst -0.234164147 -0.1885435919 0.09081325 0.011003858 -0.0920235990
area_worst -0.273399584 -0.1420648558 -0.41004720 0.060047387 0.1467901315
smoothness_worst -0.278030197 0.5015516751 0.23451384 -0.129723903 0.1648492374
compactness_worst -0.004037123 -0.0735745143 0.02020070 0.229280589 0.1813748671
concavity_worst -0.191313419 -0.1039079796 -0.04578612 -0.046482792 -0.1321005945
concave.points_worst -0.075485316 0.0758138963 -0.26022962 0.033022340 0.0008860815
symmetry_worst 0.430658116 -0.2787138431 0.11725053 -0.116759236 0.1627085487
fractal_dimension_worst 0.159394300 0.0235647497 -0.01149448 -0.104991974 -0.0923439434
PC22 PC23 PC24 PC25 PC26 PC27
radius_mean -0.07292890 -0.0985526942 -0.18257944 -0.01922650 -0.129476396 -0.131526670
texture_mean -0.09480063 -0.0005549975 0.09878679 0.08474593 -0.024556664 -0.017357309
perimeter_mean -0.07516048 -0.0402447050 -0.11664888 0.02701541 -0.125255946 -0.115415423
area_mean -0.09756578 0.0077772734 0.06984834 -0.21004078 0.362727403 0.466612477
smoothness_mean -0.06382295 -0.0206657211 0.06869742 0.02895489 -0.037003686 0.069689923
compactness_mean 0.09807756 0.0523603957 -0.10413552 0.39662323 0.262808474 0.097748705
concavity_mean 0.18521200 0.3248703785 0.04474106 -0.09697732 -0.548876170 0.364808397
concave.points_mean 0.31185243 -0.0514087968 0.08402770 -0.18645160 0.387643377 -0.454699351
symmetry_mean 0.01840673 -0.0512005770 0.01933947 -0.02458369 -0.016044038 -0.015164835
fractal_dimension_mean -0.28786888 -0.0846898562 -0.13326055 -0.20722186 -0.097404839 -0.101244946
radius_se 0.15027468 -0.2641253170 -0.55870157 -0.17493043 0.049977080 0.212982901
texture_se -0.04845693 -0.0008738805 0.02426730 0.05698648 -0.011237242 -0.010092889
perimeter_se -0.15935280 0.0900742110 0.51675039 0.07292764 0.103653282 0.041691553
area_se -0.06423262 0.0982150746 -0.02246072 0.13185041 -0.155304589 -0.313358657
smoothness_se -0.05054490 -0.0598177179 0.01563119 0.03121070 -0.007717557 -0.009052154
compactness_se 0.04528769 0.0091038710 -0.12177779 0.17316455 -0.049727632 0.046536088
concavity_se 0.20521269 -0.3875423290 0.18820504 0.01593998 0.091454968 -0.084224797
concave.points_se 0.07254538 0.3517550738 -0.10966898 -0.12954655 -0.017941919 -0.011165509
symmetry_se 0.08465443 -0.0423628949 0.00322620 -0.01951493 -0.017267849 -0.019975983
fractal_dimension_se -0.24470508 0.0857810992 0.07519442 -0.08417120 0.035488974 -0.012036564
radius_worst 0.09629821 -0.0556767923 -0.15683037 0.07070972 -0.197054744 -0.178666740
texture_worst 0.11111202 -0.0089228997 -0.11848460 -0.11818972 0.036469433 0.021410694
perimeter_worst -0.01722163 0.0633448296 0.23711317 0.11803403 -0.244103670 -0.241031046
area_worst 0.09695982 0.1908896250 0.14406303 -0.03828995 0.231359525 0.237162466
smoothness_worst 0.06825409 0.0936901494 -0.01099014 -0.04796476 0.012602464 -0.040853568
compactness_worst -0.02967641 -0.1479209247 0.18674995 -0.62438494 -0.100463424 -0.070505414
concavity_worst -0.46042619 0.2864331353 -0.28885257 0.11577034 0.266853781 -0.142905801
concave.points_worst -0.29984056 -0.5675277966 0.10734024 0.26319634 -0.133574507 0.230901389
symmetry_worst -0.09714484 0.1213434508 -0.01438181 0.04529962 0.028184296 0.022790444
fractal_dimension_worst 0.46947115 0.0076253382 0.03782545 0.28013348 0.004520482 0.059985998
PC28 PC29 PC30
radius_mean 2.111940e-01 2.114605e-01 0.7024140910
texture_mean -6.581146e-05 -1.053393e-02 0.0002736610
perimeter_mean 8.433827e-02 3.838261e-01 -0.6898969685
area_mean -2.725083e-01 -4.227949e-01 -0.0329473482
smoothness_mean 1.479269e-03 -3.434667e-03 -0.0048474577
compactness_mean -5.462767e-03 -4.101677e-02 0.0446741863
concavity_mean 4.553864e-02 -1.001479e-02 0.0251386661
concave.points_mean -8.883097e-03 -4.206949e-03 -0.0010772653
symmetry_mean 1.433026e-03 -7.569862e-03 -0.0012803794
fractal_dimension_mean -6.311687e-03 7.301433e-03 -0.0047556848
radius_se -1.922239e-01 1.184421e-01 -0.0087110937
texture_se -5.622611e-03 -8.776279e-03 -0.0010710392
perimeter_se 2.631919e-01 -6.100219e-03 0.0137293906
area_se -4.206811e-02 -8.592591e-02 0.0011053260
smoothness_se 9.792963e-03 1.776386e-03 -0.0016082109
compactness_se -1.539555e-02 3.158134e-03 0.0019156224
concavity_se 5.820978e-03 1.607852e-02 -0.0089265265
concave.points_se -2.900930e-02 -2.393779e-02 -0.0021601973
symmetry_se -7.636526e-03 -5.223292e-03 0.0003293898
fractal_dimension_se 1.975646e-02 -8.341912e-03 0.0017989568
radius_worst 4.126396e-01 -6.357249e-01 -0.1356430561
texture_worst -3.902509e-04 1.723549e-02 0.0010205360
perimeter_worst -7.286809e-01 2.292180e-02 0.0797438536
area_worst 2.389603e-01 4.449359e-01 0.0397422838
smoothness_worst -1.535248e-03 7.385492e-03 0.0045832773
compactness_worst 4.869182e-02 3.566904e-06 -0.0128415624
concavity_worst -1.764090e-02 -1.267572e-02 0.0004021392
concave.points_worst 2.247567e-02 3.524045e-02 -0.0022884418
symmetry_worst 4.920481e-03 1.340423e-02 0.0003954435
fractal_dimension_worst -2.356214e-02 1.147766e-02 0.0018942925
# For 2nd part, we simply use the cumsum plot as above & see that using the horizontal line = 0.8 (80% of the total sqaure sum of PCs) is attributed to the top 5 square sum of PCs
# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
The goal of this exercise is to do hierarchical clustering of the observations. Recall from Chapter 2 that this type of clustering does not assume in advance the number of natural groups that exist in the data.
As part of the preparation for hierarchical clustering, distance between all pairs of observations are computed. Furthermore, there are different ways to link clusters together, with single, complete, and average being the most common linkage methods.
Instructions
100 XP
The variables you created before, wisc.data, diagnosis, wisc.pr, and pve, are available in your workspace.
Scale the wisc.data data and assign the result to data.scaled.
Calculate the (Euclidean) distances between all pairs of observations in the new scaled dataset and assign the result to data.dist.
Create a hierarchical clustering model using complete linkage. Manually specify the method argument to hclust() and assign the results to wisc.hclust.
# Scale the wisc.data data: data.scaled
data.scaled <- scale(wisc.data)
# Calculate the (Euclidean) distances: data.dist
data.dist <- dist(data.scaled)
# Create a hierarchical clustering model: wisc.hclust
wisc.hclust <- hclust(data.dist, method = "complete")
Let’s use the hierarchical clustering model you just created to determine a height (or distance between clusters) where a certain number of clusters exists. The variables you created before—wisc.data, diagnosis, wisc.pr, pve, and wisc.hclust—are all available in your workspace.
Using the plot() function, what is the height at which the clustering model has 4 clusters?
Instructions
50 XP
Possible Answers
1.20 [ans: At that height, the horizontal line intersects the dendogram 4 times]
4
10
24
unique(cutree(wisc.hclust, h=20))
[1] 1 2 3 4
In this exercise, you will compare the outputs from your hierarchical clustering model to the actual diagnoses. Normally when performing unsupervised learning like this, a target variable isn’t available. We do have it with this dataset, however, so it can be used to check the performance of the clustering model.
When performing supervised learning—that is, when you’re trying to predict some target variable of interest and that target variable is available in the original data—using clustering to create new features may or may not improve the performance of the final model. This exercise will help you determine if, in this case, hierarchical clustering provides a promising new feature.
Instructions
100 XP
wisc.data, diagnosis, wisc.pr, pve, and wisc.hclust are available in your workspace.
Use cutree() to cut the tree so that it has 4 clusters. Assign the output to the variable wisc.hclust.clusters.
Use the table() function to compare the cluster membership to the actual diagnoses.
# Cut tree so that it has 4 clusters: wisc.hclust.clusters
wisc.hclust.clusters <- cutree(wisc.hclust, k = 4)
# Compare cluster membership to actual diagnoses
table(wisc.hclust.clusters, diagnosis)
diagnosis
wisc.hclust.clusters 0 1
1 12 165
2 2 5
3 343 40
4 0 2
As you now know, there are two main types of clustering: hierarchical and k-means.
In this exercise, you will create a k-means clustering model on the Wisconsin breast cancer data and compare the results to the actual diagnoses and the results of your hierarchical clustering model. Take some time to see how each clustering model performs in terms of separating the two diagnoses and how the clustering models compare to each other.
Instructions
100 XP
wisc.data, diagnosis, and wisc.hclust.clusters are still available.
Create a k-means model on wisc.data, assigning the result to wisc.km. Be sure to create 2 clusters, corresponding to the actual number of diagnosis. Also, remember to scale the data and repeat the algorithm 20 times to find a well performing model.
Use the table() function to compare the cluster membership of the k-means model to the actual diagnoses contained in the diagnosis vector. How well does k-means separate the two diagnoses?
Use the table() function to compare the cluster membership of the k-means model to the hierarchical clustering model. Recall the cluster membership of the hierarchical clustering model is contained in wisc.hclust.clusters.
# Create a k-means model on wisc.data: wisc.km
wisc.km<-kmeans(scale(wisc.data), centers = 2, nstart = 20)
# Compare k-means to actual diagnoses
table(wisc.km$cluster, diagnosis)
diagnosis
0 1
1 343 37
2 14 175
sum(apply(table(wisc.km$cluster, diagnosis), 1, min))
[1] 51
# Compare k-means to hierarchical clustering
table(wisc.hclust.clusters, wisc.km$cluster)
wisc.hclust.clusters 1 2
1 17 160
2 0 7
3 363 20
4 0 2
sum(apply(table(wisc.hclust.clusters, wisc.km$cluster), 1, min))
[1] 37
Remarks: Looking at the second table you generated, it looks like clusters 1, 2, and 4 from the hierarchical clustering model can be interpreted as the cluster 1 equivalent from the k-means algorithm, and cluster 3 can be interpreted as the cluster 2 equivalent.
In this final exercise, you will put together several steps you used earlier and, in doing so, you will experience some of the creativity that is typical in unsupervised learning.
Recall from earlier exercises that the PCA model required significantly fewer features to describe 80% and 95% of the variability of the data. In addition to normalizing data and potentially avoiding overfitting, PCA also uncorrelates the variables, sometimes improving the performance of other modeling techniques.
Let’s see if PCA improves or degrades the performance of hierarchical clustering.
Instructions
100 XP
wisc.pr, diagnosis, wisc.hclust.clusters, and wisc.km are still available in your workspace.
Using the minimum number of principal components required to describe at least 90% of the variability in the data, create a hierarchical clustering model with complete linkage. Assign the results to wisc.pr.hclust.
Cut this hierarchical clustering model into 4 clusters and assign the results to wisc.pr.hclust.clusters.
Using table(), compare the results from your new hierarchical clustering model with the actual diagnoses. How well does the newly created model with four clusters separate out the two diagnoses?
How well do the k-means and hierarchical clustering models you created in previous exercises do in terms of separating the diagnoses? Again, use the table() function to compare the output of each model with the vector containing the actual diagnoses.
# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")
# Cut model into 4 clusters: wisc.pr.hclust.clusters
wisc.pr.hclust.clusters<-cutree(wisc.pr.hclust, k=4)
# Compare to actual diagnoses
table(wisc.pr.hclust.clusters, diagnosis)
diagnosis
wisc.pr.hclust.clusters 0 1
1 5 113
2 350 97
3 2 0
4 0 2
table(wisc.hclust.clusters, diagnosis)
diagnosis
wisc.hclust.clusters 0 1
1 12 165
2 2 5
3 343 40
4 0 2
# Compare to k-means and hierarchical
table(wisc.km$cluster, diagnosis)
diagnosis
0 1
1 343 37
2 14 175
table(wisc.km$cluster,wisc.pr.hclust.clusters)
wisc.pr.hclust.clusters
1 2 3 4
1 3 377 0 0
2 115 70 2 2