library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(class)
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(assertive)
##
## Attaching package: 'assertive'
## The following objects are masked from 'package:purrr':
##
## is_atomic, is_character, is_double, is_empty, is_formula,
## is_function, is_integer, is_list, is_logical, is_null, is_numeric,
## is_vector
## The following object is masked from 'package:tibble':
##
## has_rownames
library(fst)
library(broom)
library(naivebayes)
## naivebayes 0.9.7 loaded
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(rpart)
library(rpart.plot)
library(sigr)
library(WVPlots)
## Loading required package: wrapr
##
## Attaching package: 'wrapr'
## The following object is masked from 'package:dplyr':
##
## coalesce
## The following objects are masked from 'package:tidyr':
##
## pack, unpack
## The following object is masked from 'package:tibble':
##
## view
library(vtreat)
library(stringr)
pokemon_raw <- read.csv("https://assets.datacamp.com/production/course_6430/datasets/Pokemon.csv")
#write.csv(Pokemon, "C:/Users/4/Documents"/R/pokemon_raw.csv", row.names= FALSE)
wisc.df <- read.csv("https://assets.datacamp.com/production/course_6430/datasets/WisconsinCancer.csv")
Three main types of machine learning:
find patterns in the features of the data
visualization of high dimensional data
3.pre-processing step for supervised learning
An algorithm used to find homogeneous subgroups in a population
K-means comes in base R
You can run kmeans many times to estimate the number od subgroups when it is not known a priory
x <- matrix(rnorm(100, 1, 3), nrow = 50, ncol = 2)
head(x)
## [,1] [,2]
## [1,] -1.070823 3.9289233
## [2,] 3.090331 -0.4667977
## [3,] 1.101040 2.4800939
## [4,] -1.057556 3.3155996
## [5,] 1.311827 -0.5061106
## [6,] -4.429845 -2.3364146
# Create the k-means model: km.out
km.out <- kmeans(x, centers=3, nstart=20)
# Inspect the result
summary(km.out)
## Length Class Mode
## cluster 50 -none- numeric
## centers 6 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
# Print the cluster membership component of the mode
km.out$cluster
## [1] 2 1 2 2 1 3 3 2 1 3 2 3 2 1 2 2 1 3 1 2 1 2 2 2 1 2 3 2 1 1 2 3 2 1 1 1 3 2
## [39] 3 2 2 1 2 1 2 2 2 1 1 3
# Print the km.out object
print(km.out)
## K-means clustering with 3 clusters of sizes 17, 23, 10
##
## Cluster means:
## [,1] [,2]
## 1 1.982061 -2.0621069
## 2 2.223571 3.2652442
## 3 -3.231963 -0.3378415
##
## Clustering vector:
## [1] 2 1 2 2 1 3 3 2 1 3 2 3 2 1 2 2 1 3 1 2 1 2 2 2 1 2 3 2 1 1 2 3 2 1 1 1 3 2
## [39] 3 2 2 1 2 1 2 2 2 1 1 3
##
## Within cluster sum of squares by cluster:
## [1] 93.06422 148.86446 94.29089
## (between_SS / total_SS = 60.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Scatter plot of x
plot(x, col=km.out$cluster, xlab="", ylab="")
Figure 1.1 K-means with 3 Clusters
In the video, you saw how kmeans() randomly initializes the centers of clusters. This random initialization can result in assigning observations to different cluster labels. Also, the random initialization can result in finding different local minimal for the k-means algorithm. This exercise will demonstrate both results.
# Set up 2 x 3 plotting grid
par(mfrow = c(2, 3))
# Set seed
set.seed(1)
for(i in 1:6) {
# Run kmeans() on x with three clusters and one start
km.out <- kmeans(x, centers=3, nstart=1)
# Plot clusters
plot(x, col = km.out$cluster,
main = km.out$tot.withinss,
xlab = "", ylab = "")
}
Figure 1.2 Clusters
The k-means algorithm assumes the number of clusters as part of the input. If you know the number of clusters in advance (e.g. due to certain business constraints) this makes setting the number of clusters easy. However, as you saw in the video, if you do not know the number of clusters and need to determine it, you will need to run the algorithm multiple times, each time with a different number of clusters. From this, you can observe how a measure of model quality changes with the number of clusters.
In this exercise, you will run kmeans() multiple times to see how model quality changes as the number of clusters changes. Plots displaying this information help to determine the number of clusters and are often referred to as scree plots.
The ideal plot will have an elbow where the quality measure improves more slowly as the number of clusters increases. This indicates that the quality of the model is no longer improving substantially as the model complexity (i.e. number of clusters) increases. In other words, the elbow indicates the number of clusters inherent in the data.
# Initialize total within sum of squares error: wss
wss <- 0
# For 1 to 15 cluster centers
for (i in 1:15) {
km.out <- kmeans(x, centers = i, nstart = 20)
# Save total within sum of squares to wss variable
wss[i] <- km.out$tot.withinss
}
# Plot total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
Figure 1.3 The total within sum of squares vs. number of clusters
# Set k equal to the number of clusters corresponding to the elbow location
k <- 2 # 3 is probably OK, too
Dealing with real data is often more challenging than dealing with synthetic data. Synthetic data helps with learning new concepts and techniques, but the next few exercises will deal with data that is closer to the type of real data you might find in your professional or academic pursuits.
The first challenge with the Pokemon data is that there is no pre-determined number of clusters. You will determine the appropriate number of clusters, keeping in mind that in real data the elbow in the scree plot might be less of a sharp elbow than in synthetic data. Use your judgement on making the determination of the number of clusters.
The second part of this exercise includes plotting the outcomes of the clustering on two dimensions, or features, of the data. These features were chosen somewhat arbitrarily for this exercise. Think about how you would use plotting and clustering to communicate interesting groups of Pokemon to other people.
An additional note: this exercise utilizes the iter.max argument to kmeans(). As you’ve seen, kmeans() is an iterative algorithm, repeating over and over until some stopping criterion is reached. The default number of iterations for kmeans() is 10, which is not enough for the algorithm to converge and reach its stopping criterion, so we’ll set the number of iterations to 50 to overcome this issue. To see what happens when kmeans() does not converge, try running the example with a lower number of iterations (e.g. 3). This is another example of what might happen when you encounter real data and use real cases.
pokemon <- pokemon_raw %>% select(6:11)
head(pokemon)
## HitPoints Attack Defense SpecialAttack SpecialDefense Speed
## 1 45 49 49 65 65 45
## 2 60 62 63 80 80 60
## 3 80 82 83 100 100 80
## 4 80 100 123 122 120 80
## 5 39 52 43 60 50 65
## 6 58 64 58 80 65 80
# Initialize total within sum of squares error: wss
wss <- 0
# Look over 1 to 15 possible clusters
for (i in 1:15) {
# Fit the model: km.out
km.out <- kmeans(pokemon, centers = i, nstart = 20, iter.max = 50)
# Save the within cluster sum of squares
wss[i] <- km.out$tot.withinss
}
# Produce a scree plot
plot(1:15, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Within groups sum of squares")
Figure 1.$ scree plot
# Select number of clusters
k <- 4
# Build model with k clusters: km.out
km.pokemon <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)
# View the resulting model
km.pokemon
## K-means clustering with 4 clusters of sizes 283, 114, 115, 288
##
## Cluster means:
## HitPoints Attack Defense SpecialAttack SpecialDefense Speed
## 1 50.29682 54.03180 51.62898 47.90459 49.15548 49.74912
## 2 89.20175 121.09649 92.73684 120.45614 97.67544 100.44737
## 3 71.30435 92.91304 121.42609 63.89565 88.23478 52.36522
## 4 79.18056 81.31944 69.19097 82.01042 77.53125 80.10417
##
## Clustering vector:
## [1] 1 4 4 2 1 4 4 2 2 1 4 4 2 1 1 4 1 1 4 4 1 1 4 2 1 4 1 4 1 4 1 4 1 3 1 1 4
## [38] 1 1 4 1 4 1 4 1 4 1 4 1 4 4 1 3 1 4 1 4 1 4 1 4 1 4 1 2 1 1 4 1 4 4 2 1 4
## [75] 3 1 4 4 1 4 1 3 3 4 4 1 3 3 1 4 1 1 4 1 4 1 4 1 3 1 4 4 2 3 1 4 1 3 1 4 1
## [112] 4 1 3 4 3 1 1 3 1 3 4 4 4 2 1 4 1 4 1 4 4 4 4 4 4 3 2 4 1 4 2 4 1 1 4 4 2
## [149] 4 1 3 1 3 4 2 4 2 2 2 1 4 2 2 2 2 2 1 4 4 1 4 4 1 4 4 1 4 1 4 1 4 1 1 4 1
## [186] 4 1 1 1 1 4 1 4 1 1 4 2 3 1 4 3 4 1 1 4 1 1 4 4 1 3 4 3 4 4 4 1 4 4 1 3 4
## [223] 3 3 3 1 4 4 3 3 3 4 3 4 1 4 1 3 1 4 1 1 4 1 4 3 1 4 2 4 1 3 4 4 1 1 3 1 1
## [260] 1 4 4 2 2 3 1 4 2 2 2 2 2 1 4 4 2 1 4 2 2 1 4 4 2 1 4 1 4 1 1 4 1 1 1 1 4
## [297] 1 1 4 1 4 1 4 1 1 4 2 1 4 1 4 1 4 2 1 4 1 1 1 4 1 4 1 3 1 1 1 3 1 3 1 3 3
## [334] 3 1 4 4 1 4 2 4 4 4 4 4 1 4 1 4 2 4 4 1 4 2 3 1 4 1 1 1 4 1 4 1 4 2 4 4 4
## [371] 4 1 4 1 4 1 3 1 3 1 3 1 4 4 3 1 4 2 1 3 4 4 4 2 1 1 4 2 1 4 4 1 3 3 3 1 1
## [408] 3 2 2 1 3 3 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 1 3 3 1 4 4 1 4 4 1 1 4
## [445] 1 4 1 1 1 1 4 1 4 1 4 3 3 1 3 3 3 4 1 3 4 1 4 1 4 1 4 4 1 4 1 4 2 4 4 1 4
## [482] 1 1 4 1 3 1 1 1 4 3 1 4 2 2 4 1 2 2 1 3 1 3 1 4 4 1 4 1 1 4 2 4 3 3 3 3 2
## [519] 2 4 4 3 4 3 4 4 4 2 3 3 4 4 4 4 4 4 4 3 2 2 2 2 2 2 2 2 3 4 2 2 2 2 2 2 1
## [556] 4 4 1 4 4 1 4 4 1 4 1 1 3 1 4 1 4 1 4 1 4 1 4 1 1 4 1 4 1 3 3 1 4 1 4 4 3
## [593] 1 3 3 1 4 4 3 4 1 3 4 1 1 4 1 4 1 4 4 1 1 4 1 4 4 4 1 3 1 3 4 1 3 3 3 4 2
## [630] 1 4 1 4 1 4 1 4 4 1 1 4 1 4 1 4 4 1 4 4 1 3 1 4 1 4 4 1 4 1 3 1 3 3 1 4 4
## [667] 1 4 1 1 4 1 4 2 1 4 4 1 4 4 1 4 3 1 3 1 3 3 1 4 1 3 4 3 1 4 2 1 4 2 2 2 2
## [704] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 3 1 4 4 1 4 4 1 4 1 1 4 1 1 4 1 4 1 1 4
## [741] 1 4 1 4 4 1 4 4 1 3 2 3 1 4 1 4 1 4 1 3 1 3 1 4 1 4 1 3 1 4 4 4 4 3 1 4 2
## [778] 4 1 4 1 1 1 1 3 3 3 3 1 3 1 4 2 2 2 3 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 513476.6 408271.9 455965.7 871531.3
## (between_SS / total_SS = 47.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
# Plot of Defense vs. Speed by cluster membership
plot(pokemon[, c("Defense", "Speed")],
col = km.pokemon$cluster,
main = paste("k-means clustering of Pokemon with", k, "clusters"),
xlab = "Defense", ylab = "Speed")
Figure 1.5 Defense vs. Speed by cluster membership
# Create hierarchical clustering model: hclust.out
hclust.out <- hclust(dist(x))
# Inspect the result
summary(hclust.out)
## Length Class Mode
## merge 98 -none- numeric
## height 49 -none- numeric
## order 50 -none- numeric
## labels 0 -none- NULL
## method 1 -none- character
## call 2 -none- call
## dist.method 1 -none- character
plot(hclust.out)
abline(h = 7, col = "red")
# Cut by height
cutree(hclust.out, h = 7)
## [1] 1 2 3 1 2 4 2 3 2 4 5 4 3 2 3 3 2 1 6 3 2 3 3 3 2 3 1 5 6 2 3 2 3 2 6 6 4 5
## [39] 1 3 5 2 5 2 3 3 3 6 2 1
# Cut by number of clusters
cutree(hclust.out, k = 3)
## [1] 1 2 1 1 2 3 2 1 2 3 1 3 1 2 1 1 2 1 2 1 2 1 1 1 2 1 1 1 2 2 1 2 1 2 2 2 3 1
## [39] 1 1 1 2 1 2 1 1 1 2 2 1
2.rule of thumb. * complete and average produce more balanced trees and are more commonly used. * single fuses observations in one at a time and produces more unbalanced trees * centroid can create inversion where clusters are put below single values. its not used often
3.practical matters * data needs to be scaled so that features have the same mean and standard deviation * normalized features have a mean of zero and a sd of one
# Cluster using complete linkage: hclust.complete
hclust.complete <- hclust(dist(x), method = "complete")
# Cluster using average linkage: hclust.average
hclust.average <- hclust(dist(x), method = "average")
# Cluster using single linkage: hclust.single
hclust.single <- hclust(dist(x), method = "single")
# Plot dendrogram of hclust.complete
plot(hclust.complete, main = "Complete")
# Plot dendrogram of hclust.average
plot(hclust.average, main = "Average")
# Plot dendrogram of hclust.single
plot(hclust.single, main = "Single")
Recall from the video that clustering real data may require scaling the features if they have different distributions. So far in this chapter, you have been working with synthetic data that did not need scaling.
# View column standard deviations
apply(pokemon, 2, sd)
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 25.53467 32.45737 31.18350 32.72229 27.82892
## Speed
## 29.06047
# Scale the data
pokemon.scaled <- scale(pokemon)
# Create hierarchical clustering model: hclust.pokemon
hclust.pokemon <- hclust(dist(pokemon.scaled), method = "complete")
# Apply cutree() to hclust.pokemon: cut.pokemon
cut.pokemon <- cutree(hclust.pokemon, k = 3)
# Compare methods
table(km.pokemon$cluster, cut.pokemon)
## cut.pokemon
## 1 2 3
## 1 283 0 0
## 2 114 0 0
## 3 114 0 1
## 4 277 11 0
Introduction to PCA
Example: Principle components with iris dataset * center and scale - for each point, subtract the mean and divide by the sd * the summary of the model shows you the proportion of variance explained by each principal component * I think the rotation is the distance of the point from each principal component or something like that.
# Perform scaled PCA: pr.out
pr.out <- prcomp(x = pokemon, scale = T, center = T)
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.6466 1.0457 0.8825 0.8489 0.65463 0.51681
## Proportion of Variance 0.4519 0.1822 0.1298 0.1201 0.07142 0.04451
## Cumulative Proportion 0.4519 0.6342 0.7640 0.8841 0.95549 1.00000
# Inspect model output
pr.out
## Standard deviations (1, .., p=6):
## [1] 1.6466450 1.0457158 0.8824654 0.8489201 0.6546298 0.5168055
##
## Rotation (n x k) = (6 x 6):
## PC1 PC2 PC3 PC4 PC5
## HitPoints 0.3898858 0.08483455 -0.47192614 0.7176913 -0.21999056
## Attack 0.4392537 -0.01182493 -0.59415339 -0.4058359 0.19025457
## Defense 0.3637473 0.62878867 0.06933913 -0.4192373 -0.05903197
## SpecialAttack 0.4571623 -0.30541446 0.30561186 0.1475166 0.73534497
## SpecialDefense 0.4485704 0.23909670 0.56559403 0.1854448 -0.30019970
## Speed 0.3354405 -0.66846305 0.07851327 -0.2971625 -0.53016082
## PC6
## HitPoints 0.2336690
## Attack -0.5029896
## Defense 0.5368986
## SpecialAttack 0.2045304
## SpecialDefense -0.5451707
## Speed 0.2551400
pr.out$rotation
## PC1 PC2 PC3 PC4 PC5
## HitPoints 0.3898858 0.08483455 -0.47192614 0.7176913 -0.21999056
## Attack 0.4392537 -0.01182493 -0.59415339 -0.4058359 0.19025457
## Defense 0.3637473 0.62878867 0.06933913 -0.4192373 -0.05903197
## SpecialAttack 0.4571623 -0.30541446 0.30561186 0.1475166 0.73534497
## SpecialDefense 0.4485704 0.23909670 0.56559403 0.1854448 -0.30019970
## Speed 0.3354405 -0.66846305 0.07851327 -0.2971625 -0.53016082
## PC6
## HitPoints 0.2336690
## Attack -0.5029896
## Defense 0.5368986
## SpecialAttack 0.2045304
## SpecialDefense -0.5451707
## Speed 0.2551400
pr.out$scale
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 25.53467 32.45737 31.18350 32.72229 27.82892
## Speed
## 29.06047
pr.out$center
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 69.25875 79.00125 73.84250 72.82000 71.90250
## Speed
## 68.27750
head(pr.out$x,10)
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] -1.5554017 -0.02146869 0.6660872 0.18406113 0.403554706 -0.30281472
## [2,] -0.3626397 -0.05023711 0.6674959 0.26908616 0.225647228 -0.19436515
## [3,] 1.2793512 -0.06268101 0.6235239 0.33118417 0.001544326 -0.06813435
## [4,] 2.6192779 0.70382271 0.9949151 -0.19919599 0.309976280 0.08732570
## [5,] -1.7571845 -0.70573748 0.4111965 -0.26843393 0.168771440 -0.06932488
## [6,] -0.4353607 -0.74735577 0.4059051 -0.04938257 0.061009227 0.13969565
## [7,] 1.3323690 -0.84380142 0.4459891 0.05328867 0.039156957 0.32218086
## [8,] 2.6332254 -0.39114750 -0.1265620 -0.87086758 0.718241514 0.30875625
## [9,] 2.7851485 -1.06001436 1.1565731 0.22853461 0.956385206 -0.26293575
## [10,] -1.6463371 0.47561588 0.5726312 -0.10048350 0.086209248 -0.11271708
# Variability of each principal component: pr.var
pr.var <- pr.out$sdev^2
# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
pve
## [1] 0.45190665 0.18225358 0.12979086 0.12011089 0.07142337 0.04451466
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
Importance of scaling:
* the mtcars dataset has very different means and sd
* look at the results of the biplots with and without scaling
* without scaling the disp and hp overwhelm the other features simply because they have much higher variance in their units of measure
Sometimes scaling is appropriate when the variances of the variables are substantially different. This is commonly the case when variables have different units of measurement, for example, degrees Fahrenheit (temperature) and miles (distance). Making the decision to use scaling is an important step in performing a principal component analysis.
# Mean of each variable
colMeans(pokemon)
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 69.25875 79.00125 73.84250 72.82000 71.90250
## Speed
## 68.27750
# Standard deviation of each variable
apply(pokemon, 2, sd)
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 25.53467 32.45737 31.18350 32.72229 27.82892
## Speed
## 29.06047
# PCA model with scaling: pr.with.scaling
pr.with.scaling <- prcomp(pokemon, scale = TRUE)
# PCA model without scaling: pr.without.scaling
pr.without.scaling <- prcomp(pokemon, scale = FALSE)
# Create biplots of both for comparison
biplot(pr.with.scaling)
biplot(pr.without.scaling)
Preparing the data
url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"
# Download the data: wisc.df
wisc.df <- read.csv(url)
# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[, 3:32])
# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id
# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis == "M")
head(wisc.data)
## radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 842302 17.99 10.38 122.80 1001.0 0.11840
## 842517 20.57 17.77 132.90 1326.0 0.08474
## 84300903 19.69 21.25 130.00 1203.0 0.10960
## 84348301 11.42 20.38 77.58 386.1 0.14250
## 84358402 20.29 14.34 135.10 1297.0 0.10030
## 843786 12.45 15.70 82.57 477.1 0.12780
## compactness_mean concavity_mean concave.points_mean symmetry_mean
## 842302 0.27760 0.3001 0.14710 0.2419
## 842517 0.07864 0.0869 0.07017 0.1812
## 84300903 0.15990 0.1974 0.12790 0.2069
## 84348301 0.28390 0.2414 0.10520 0.2597
## 84358402 0.13280 0.1980 0.10430 0.1809
## 843786 0.17000 0.1578 0.08089 0.2087
## fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 842302 0.07871 1.0950 0.9053 8.589 153.40
## 842517 0.05667 0.5435 0.7339 3.398 74.08
## 84300903 0.05999 0.7456 0.7869 4.585 94.03
## 84348301 0.09744 0.4956 1.1560 3.445 27.23
## 84358402 0.05883 0.7572 0.7813 5.438 94.44
## 843786 0.07613 0.3345 0.8902 2.217 27.19
## smoothness_se compactness_se concavity_se concave.points_se
## 842302 0.006399 0.04904 0.05373 0.01587
## 842517 0.005225 0.01308 0.01860 0.01340
## 84300903 0.006150 0.04006 0.03832 0.02058
## 84348301 0.009110 0.07458 0.05661 0.01867
## 84358402 0.011490 0.02461 0.05688 0.01885
## 843786 0.007510 0.03345 0.03672 0.01137
## symmetry_se fractal_dimension_se radius_worst texture_worst
## 842302 0.03003 0.006193 25.38 17.33
## 842517 0.01389 0.003532 24.99 23.41
## 84300903 0.02250 0.004571 23.57 25.53
## 84348301 0.05963 0.009208 14.91 26.50
## 84358402 0.01756 0.005115 22.54 16.67
## 843786 0.02165 0.005082 15.47 23.75
## perimeter_worst area_worst smoothness_worst compactness_worst
## 842302 184.60 2019.0 0.1622 0.6656
## 842517 158.80 1956.0 0.1238 0.1866
## 84300903 152.50 1709.0 0.1444 0.4245
## 84348301 98.87 567.7 0.2098 0.8663
## 84358402 152.20 1575.0 0.1374 0.2050
## 843786 103.40 741.6 0.1791 0.5249
## concavity_worst concave.points_worst symmetry_worst
## 842302 0.7119 0.2654 0.4601
## 842517 0.2416 0.1860 0.2750
## 84300903 0.4504 0.2430 0.3613
## 84348301 0.6869 0.2575 0.6638
## 84358402 0.4000 0.1625 0.2364
## 843786 0.5355 0.1741 0.3985
## fractal_dimension_worst
## 842302 0.11890
## 842517 0.08902
## 84300903 0.08758
## 84348301 0.17300
## 84358402 0.07678
## 843786 0.12440
The next step in your analysis is to perform PCA on wisc.data.
# Check column means and standard deviations
colMeans(wisc.data)
## radius_mean texture_mean perimeter_mean
## 1.412729e+01 1.928965e+01 9.196903e+01
## area_mean smoothness_mean compactness_mean
## 6.548891e+02 9.636028e-02 1.043410e-01
## concavity_mean concave.points_mean symmetry_mean
## 8.879932e-02 4.891915e-02 1.811619e-01
## fractal_dimension_mean radius_se texture_se
## 6.279761e-02 4.051721e-01 1.216853e+00
## perimeter_se area_se smoothness_se
## 2.866059e+00 4.033708e+01 7.040979e-03
## compactness_se concavity_se concave.points_se
## 2.547814e-02 3.189372e-02 1.179614e-02
## symmetry_se fractal_dimension_se radius_worst
## 2.054230e-02 3.794904e-03 1.626919e+01
## texture_worst perimeter_worst area_worst
## 2.567722e+01 1.072612e+02 8.805831e+02
## smoothness_worst compactness_worst concavity_worst
## 1.323686e-01 2.542650e-01 2.721885e-01
## concave.points_worst symmetry_worst fractal_dimension_worst
## 1.146062e-01 2.900756e-01 8.394582e-02
apply(wisc.data, 2, sd)
## radius_mean texture_mean perimeter_mean
## 3.524049e+00 4.301036e+00 2.429898e+01
## area_mean smoothness_mean compactness_mean
## 3.519141e+02 1.406413e-02 5.281276e-02
## concavity_mean concave.points_mean symmetry_mean
## 7.971981e-02 3.880284e-02 2.741428e-02
## fractal_dimension_mean radius_se texture_se
## 7.060363e-03 2.773127e-01 5.516484e-01
## perimeter_se area_se smoothness_se
## 2.021855e+00 4.549101e+01 3.002518e-03
## compactness_se concavity_se concave.points_se
## 1.790818e-02 3.018606e-02 6.170285e-03
## symmetry_se fractal_dimension_se radius_worst
## 8.266372e-03 2.646071e-03 4.833242e+00
## texture_worst perimeter_worst area_worst
## 6.146258e+00 3.360254e+01 5.693570e+02
## smoothness_worst compactness_worst concavity_worst
## 2.283243e-02 1.573365e-01 2.086243e-01
## concave.points_worst symmetry_worst fractal_dimension_worst
## 6.573234e-02 6.186747e-02 1.806127e-02
# Execute PCA, scaling if appropriate: wisc.pr
wisc.pr <- prcomp(wisc.data, scale = T, center = T)
# Look at summary of results
summary(wisc.pr)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 3.6444 2.3857 1.67867 1.40735 1.28403 1.09880 0.82172
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025 0.02251
## Cumulative Proportion 0.4427 0.6324 0.72636 0.79239 0.84734 0.88759 0.91010
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.69037 0.6457 0.59219 0.5421 0.51104 0.49128 0.39624
## Proportion of Variance 0.01589 0.0139 0.01169 0.0098 0.00871 0.00805 0.00523
## Cumulative Proportion 0.92598 0.9399 0.95157 0.9614 0.97007 0.97812 0.98335
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 0.30681 0.28260 0.24372 0.22939 0.22244 0.17652 0.1731
## Proportion of Variance 0.00314 0.00266 0.00198 0.00175 0.00165 0.00104 0.0010
## Cumulative Proportion 0.98649 0.98915 0.99113 0.99288 0.99453 0.99557 0.9966
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 0.16565 0.15602 0.1344 0.12442 0.09043 0.08307 0.03987
## Proportion of Variance 0.00091 0.00081 0.0006 0.00052 0.00027 0.00023 0.00005
## Cumulative Proportion 0.99749 0.99830 0.9989 0.99942 0.99969 0.99992 0.99997
## PC29 PC30
## Standard deviation 0.02736 0.01153
## Proportion of Variance 0.00002 0.00000
## Cumulative Proportion 1.00000 1.00000
We can see from the charts that pc1 and pc2 overlap less than pc1 and pc3. This is expected as pc1 and pc2 are meant to be orthogonal and explain different variance pc2 and pc3 overlap more then either of them overlap with pc1
par(mfrow=c(2,2))
# Create a biplot of wisc.pr
biplot(wisc.pr)
# Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1),
xlab = "PC1", ylab = "PC2")
# Repeat for components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1),
xlab = "PC1", ylab = "PC3")
# Do additional data exploration of your choosing below (optional)
plot(wisc.pr$x[, c(2, 3)], col = (diagnosis + 1),
xlab = "PC2", ylab = "PC3")
# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))
# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))
# Calculate variability of each component
pr.var <- wisc.pr$sdev^2
# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
As part of the preparation for hierarchical clustering, distance between all pairs of observations are computed. Furthermore, there are different ways to link clusters together, with single, complete, and average being the most common linkage methods.
wisc.pr$rotation[1:10,1:2]
## PC1 PC2
## radius_mean -0.21890244 0.23385713
## texture_mean -0.10372458 0.05970609
## perimeter_mean -0.22753729 0.21518136
## area_mean -0.22099499 0.23107671
## smoothness_mean -0.14258969 -0.18611302
## compactness_mean -0.23928535 -0.15189161
## concavity_mean -0.25840048 -0.06016536
## concave.points_mean -0.26085376 0.03476750
## symmetry_mean -0.13816696 -0.19034877
## fractal_dimension_mean -0.06436335 -0.36657547
# Scale the wisc.data data: data.scaled
data.scaled <- scale(wisc.data)
head(data.scaled)
## radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 842302 1.0960995 -2.0715123 1.2688173 0.9835095 1.5670875
## 842517 1.8282120 -0.3533215 1.6844726 1.9070303 -0.8262354
## 84300903 1.5784992 0.4557859 1.5651260 1.5575132 0.9413821
## 84348301 -0.7682333 0.2535091 -0.5921661 -0.7637917 3.2806668
## 84358402 1.7487579 -1.1508038 1.7750113 1.8246238 0.2801253
## 843786 -0.4759559 -0.8346009 -0.3868077 -0.5052059 2.2354545
## compactness_mean concavity_mean concave.points_mean symmetry_mean
## 842302 3.2806281 2.65054179 2.5302489 2.215565542
## 842517 -0.4866435 -0.02382489 0.5476623 0.001391139
## 84300903 1.0519999 1.36227979 2.0354398 0.938858720
## 84348301 3.3999174 1.91421287 1.4504311 2.864862154
## 84358402 0.5388663 1.36980615 1.4272370 -0.009552062
## 843786 1.2432416 0.86554001 0.8239307 1.004517928
## fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 842302 2.2537638 2.4875451 -0.5647681 2.8305403 2.4853907
## 842517 -0.8678888 0.4988157 -0.8754733 0.2630955 0.7417493
## 84300903 -0.3976580 1.2275958 -0.7793976 0.8501802 1.1802975
## 84348301 4.9066020 0.3260865 -0.1103120 0.2863415 -0.2881246
## 84358402 -0.5619555 1.2694258 -0.7895490 1.2720701 1.1893103
## 843786 1.8883435 -0.2548461 -0.5921406 -0.3210217 -0.2890039
## smoothness_se compactness_se concavity_se concave.points_se
## 842302 -0.2138135 1.31570389 0.7233897 0.66023900
## 842517 -0.6048187 -0.69231710 -0.4403926 0.25993335
## 84300903 -0.2967439 0.81425704 0.2128891 1.42357487
## 84348301 0.6890953 2.74186785 0.8187979 1.11402678
## 84358402 1.4817634 -0.04847723 0.8277425 1.14319885
## 843786 0.1562093 0.44515196 0.1598845 -0.06906279
## symmetry_se fractal_dimension_se radius_worst texture_worst
## 842302 1.1477468 0.90628565 1.8850310 -1.35809849
## 842517 -0.8047423 -0.09935632 1.8043398 -0.36887865
## 84300903 0.2368272 0.29330133 1.5105411 -0.02395331
## 84348301 4.7285198 2.04571087 -0.2812170 0.13386631
## 84358402 -0.3607748 0.49888916 1.2974336 -1.46548091
## 843786 0.1340009 0.48641784 -0.1653528 -0.31356043
## perimeter_worst area_worst smoothness_worst compactness_worst
## 842302 2.3015755 1.9994782 1.3065367 2.6143647
## 842517 1.5337764 1.8888270 -0.3752817 -0.4300658
## 84300903 1.3462906 1.4550043 0.5269438 1.0819801
## 84348301 -0.2497196 -0.5495377 3.3912907 3.8899747
## 84358402 1.3373627 1.2196511 0.2203623 -0.3131190
## 843786 -0.1149083 -0.2441054 2.0467119 1.7201029
## concavity_worst concave.points_worst symmetry_worst
## 842302 2.1076718 2.2940576 2.7482041
## 842517 -0.1466200 1.0861286 -0.2436753
## 84300903 0.8542223 1.9532817 1.1512420
## 84348301 1.9878392 2.1738732 6.0407261
## 84358402 0.6126397 0.7286181 -0.8675896
## 843786 1.2621327 0.9050914 1.7525273
## fractal_dimension_worst
## 842302 1.9353117
## 842517 0.2809428
## 84300903 0.2012142
## 84348301 4.9306719
## 84358402 -0.3967505
## 843786 2.2398308
# Calculate the (Euclidean) distances: data.dist
data.dist <- dist(data.scaled)
# Create a hierarchical clustering model: wisc.hclust
wisc.hclust <- hclust(data.dist, method = "complete")
plot(wisc.hclust, col = "blue")
# Cut tree so that it has 4 clusters: wisc.hclust.clusters
wisc.hclust.clusters <- cutree(wisc.hclust, k = 4)
# Compare cluster membership to actual diagnoses
table(wisc.hclust.clusters, diagnosis)
## diagnosis
## wisc.hclust.clusters 0 1
## 1 12 165
## 2 2 5
## 3 343 40
## 4 0 2
# count out of place observations based on cluster
# basically just summing the row mins here
sum(apply(table(wisc.hclust.clusters, diagnosis), 1, min))
## [1] 54
# Looks like 54 tumors that are not clear with the diagnosis based on the general cluster
As you now know, there are two main types of clustering: hierarchical and k-means.
# Create a k-means model on wisc.data: wisc.km
wisc.km <- kmeans(scale(wisc.data), centers = 2, nstart = 20)
# Compare k-means to actual diagnoses
table(wisc.km$cluster, diagnosis)
## diagnosis
## 0 1
## 1 14 175
## 2 343 37
sum(apply(table(wisc.km$cluster, diagnosis), 1, min))
## [1] 51
# Compare k-means to hierarchical clustering
table(wisc.hclust.clusters, wisc.km$cluster)
##
## wisc.hclust.clusters 1 2
## 1 160 17
## 2 7 0
## 3 20 363
## 4 2 0
sum(apply(table(wisc.hclust.clusters, wisc.km$cluster), 1, min))
## [1] 37
Recall from earlier exercises that the PCA model required significantly fewer features to describe 80% and 95% of the variability of the data. In addition to normalizing data and potentially avoiding over fitting, PCA also uncorrelated the variables, sometimes improving the performance of other modeling techniques.
# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")
# Cut model into 4 clusters: wisc.pr.hclust.clusters
wisc.pr.hclust.clusters <- cutree(wisc.pr.hclust, k = 4)
# Compare to actual diagnoses
t <- table(diagnosis, wisc.pr.hclust.clusters)
t
## wisc.pr.hclust.clusters
## diagnosis 1 2 3 4
## 0 5 350 2 0
## 1 113 97 0 2
sum(apply(t, 1, min))
## [1] 0
# Compare to k-means and hierarchical
t <- table(diagnosis, wisc.hclust.clusters)
t
## wisc.hclust.clusters
## diagnosis 1 2 3 4
## 0 12 2 343 0
## 1 165 5 40 2
sum(apply(t, 1, min))
## [1] 2
t <- table(diagnosis, wisc.km$cluster)
t
##
## diagnosis 1 2
## 0 14 343
## 1 175 37
sum(apply(t, 1, min))
## [1] 51
I think it might be worth adding the k-means clusters to a model. Maybe. I guess I could just try it with and with out and see which is best at predicting now that I know how to do that. : )