Libraries

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(class)
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(assertive)
## 
## Attaching package: 'assertive'
## The following objects are masked from 'package:purrr':
## 
##     is_atomic, is_character, is_double, is_empty, is_formula,
##     is_function, is_integer, is_list, is_logical, is_null, is_numeric,
##     is_vector
## The following object is masked from 'package:tibble':
## 
##     has_rownames
library(fst)
library(broom)
library(naivebayes)
## naivebayes 0.9.7 loaded
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(rpart)
library(rpart.plot)
library(sigr)
library(WVPlots)
## Loading required package: wrapr
## 
## Attaching package: 'wrapr'
## The following object is masked from 'package:dplyr':
## 
##     coalesce
## The following objects are masked from 'package:tidyr':
## 
##     pack, unpack
## The following object is masked from 'package:tibble':
## 
##     view
library(vtreat)
library(stringr)

Required Data

pokemon_raw <- read.csv("https://assets.datacamp.com/production/course_6430/datasets/Pokemon.csv")

#write.csv(Pokemon, "C:/Users/4/Documents"/R/pokemon_raw.csv", row.names= FALSE)

wisc.df <- read.csv("https://assets.datacamp.com/production/course_6430/datasets/WisconsinCancer.csv")

Course Description

Three main types of machine learning:

  1. unsupervised learning
    • goal is to find structure in unlabeled data
    • unlabeled data is data with no targets
  2. supervised learning
    • regression or classification
    • goal is to predict the amount or the label
  3. reinforcement learning
    • a computer learns by feedback from operating in a real or synthetic environment
Two major goals:
  1. find homogeneous subgroups within a population
  • this is called clusters
  • example: segmenting a market of consumers based on demographic features and purchasing history
  • example: find similar movies based on features of each movie and reviews of the movies
  1. find patterns in the features of the data
  • dimensionality reduction - a method to decrease the number of features to describe an observation while maintaining the maximum information content under the constraints of lower dimensionality
Dimensionality Reduction
  1. find patterns in the features of the data

  2. visualization of high dimensional data

    • its hard to produce good visualizations past 3 or 4 dimensions and also to consume

3.pre-processing step for supervised learning

Chapter 1 => Unsupervised learning in R

  1. An algorithm used to find homogeneous subgroups in a population

  2. K-means comes in base R

    • need the data
    • number of centers or groups
    • number of runs. its start by randomly assigning points to groups and you can find local minimums so running it multiple times helps you find the global min.
  3. You can run kmeans many times to estimate the number od subgroups when it is not known a priory

x <- matrix(rnorm(100, 1, 3), nrow = 50, ncol = 2)
head(x)
##           [,1]       [,2]
## [1,] -1.070823  3.9289233
## [2,]  3.090331 -0.4667977
## [3,]  1.101040  2.4800939
## [4,] -1.057556  3.3155996
## [5,]  1.311827 -0.5061106
## [6,] -4.429845 -2.3364146
# Create the k-means model: km.out
km.out <- kmeans(x, centers=3, nstart=20)

# Inspect the result
summary(km.out)
##              Length Class  Mode   
## cluster      50     -none- numeric
## centers       6     -none- numeric
## totss         1     -none- numeric
## withinss      3     -none- numeric
## tot.withinss  1     -none- numeric
## betweenss     1     -none- numeric
## size          3     -none- numeric
## iter          1     -none- numeric
## ifault        1     -none- numeric
# Print the cluster membership component of the mode
km.out$cluster
##  [1] 2 1 2 2 1 3 3 2 1 3 2 3 2 1 2 2 1 3 1 2 1 2 2 2 1 2 3 2 1 1 2 3 2 1 1 1 3 2
## [39] 3 2 2 1 2 1 2 2 2 1 1 3
# Print the km.out object
print(km.out)
## K-means clustering with 3 clusters of sizes 17, 23, 10
## 
## Cluster means:
##        [,1]       [,2]
## 1  1.982061 -2.0621069
## 2  2.223571  3.2652442
## 3 -3.231963 -0.3378415
## 
## Clustering vector:
##  [1] 2 1 2 2 1 3 3 2 1 3 2 3 2 1 2 2 1 3 1 2 1 2 2 2 1 2 3 2 1 1 2 3 2 1 1 1 3 2
## [39] 3 2 2 1 2 1 2 2 2 1 1 3
## 
## Within cluster sum of squares by cluster:
## [1]  93.06422 148.86446  94.29089
##  (between_SS / total_SS =  60.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# Scatter plot of x
plot(x, col=km.out$cluster, xlab="", ylab="")
Figure 1.1 K-means with 3 Clusters

Figure 1.1 K-means with 3 Clusters

How kmeans() works and practical matters

  1. Process of k-means:
    • randomly assign all points to a cluster
    • calculate center of each cluster
    • convert points to cluster of nearest center
    • if no points changed, done, otherwise repeat
    • calculate new center based new points
    • convert points to cluster of nearest center
    • and so on
  2. model selection:
    • best outcome is based on total within cluster sum of squares
    • run many times to get global optimum
    • R will automatically take the run with the lowest total withinss
  3. determining number of clusters
    • scree plot
    • look for the elbow
    • find where addition on new cluster does not * change best withinss much
    • there usually is no clear elbow in real world data

Handling random algorithms

In the video, you saw how kmeans() randomly initializes the centers of clusters. This random initialization can result in assigning observations to different cluster labels. Also, the random initialization can result in finding different local minimal for the k-means algorithm. This exercise will demonstrate both results.

# Set up 2 x 3 plotting grid
par(mfrow = c(2, 3))

# Set seed
set.seed(1)

for(i in 1:6) {
  # Run kmeans() on x with three clusters and one start
  km.out <- kmeans(x, centers=3, nstart=1)
  
  # Plot clusters
  plot(x, col = km.out$cluster, 
       main = km.out$tot.withinss, 
       xlab = "", ylab = "")
}
Figure 1.2 Clusters

Figure 1.2 Clusters

Selecting number of clusters

The k-means algorithm assumes the number of clusters as part of the input. If you know the number of clusters in advance (e.g. due to certain business constraints) this makes setting the number of clusters easy. However, as you saw in the video, if you do not know the number of clusters and need to determine it, you will need to run the algorithm multiple times, each time with a different number of clusters. From this, you can observe how a measure of model quality changes with the number of clusters.

In this exercise, you will run kmeans() multiple times to see how model quality changes as the number of clusters changes. Plots displaying this information help to determine the number of clusters and are often referred to as scree plots.

The ideal plot will have an elbow where the quality measure improves more slowly as the number of clusters increases. This indicates that the quality of the model is no longer improving substantially as the model complexity (i.e. number of clusters) increases. In other words, the elbow indicates the number of clusters inherent in the data.

# Initialize total within sum of squares error: wss
wss <- 0

# For 1 to 15 cluster centers
for (i in 1:15) {
  km.out <- kmeans(x, centers = i, nstart = 20)
  # Save total within sum of squares to wss variable
  wss[i] <- km.out$tot.withinss
}

# Plot total within sum of squares vs. number of clusters
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")
Figure 1.3 The total within sum of squares vs. number of clusters

Figure 1.3 The total within sum of squares vs. number of clusters

# Set k equal to the number of clusters corresponding to the elbow location
k <- 2  # 3 is probably OK, too

Practical matters: working with real data

Dealing with real data is often more challenging than dealing with synthetic data. Synthetic data helps with learning new concepts and techniques, but the next few exercises will deal with data that is closer to the type of real data you might find in your professional or academic pursuits.

The first challenge with the Pokemon data is that there is no pre-determined number of clusters. You will determine the appropriate number of clusters, keeping in mind that in real data the elbow in the scree plot might be less of a sharp elbow than in synthetic data. Use your judgement on making the determination of the number of clusters.

The second part of this exercise includes plotting the outcomes of the clustering on two dimensions, or features, of the data. These features were chosen somewhat arbitrarily for this exercise. Think about how you would use plotting and clustering to communicate interesting groups of Pokemon to other people.

An additional note: this exercise utilizes the iter.max argument to kmeans(). As you’ve seen, kmeans() is an iterative algorithm, repeating over and over until some stopping criterion is reached. The default number of iterations for kmeans() is 10, which is not enough for the algorithm to converge and reach its stopping criterion, so we’ll set the number of iterations to 50 to overcome this issue. To see what happens when kmeans() does not converge, try running the example with a lower number of iterations (e.g. 3). This is another example of what might happen when you encounter real data and use real cases.

pokemon <- pokemon_raw %>% select(6:11)
head(pokemon)
##   HitPoints Attack Defense SpecialAttack SpecialDefense Speed
## 1        45     49      49            65             65    45
## 2        60     62      63            80             80    60
## 3        80     82      83           100            100    80
## 4        80    100     123           122            120    80
## 5        39     52      43            60             50    65
## 6        58     64      58            80             65    80
# Initialize total within sum of squares error: wss
wss <- 0
# Look over 1 to 15 possible clusters
for (i in 1:15) {
  # Fit the model: km.out
  km.out <- kmeans(pokemon, centers = i, nstart = 20, iter.max = 50)
  # Save the within cluster sum of squares
  wss[i] <- km.out$tot.withinss
}

# Produce a scree plot
plot(1:15, wss, type = "b", 
     xlab = "Number of Clusters", 
     ylab = "Within groups sum of squares")
Figure 1.$ scree plot

Figure 1.$ scree plot

# Select number of clusters
k <- 4

# Build model with k clusters: km.out
km.pokemon <- kmeans(pokemon, centers = k, nstart = 20, iter.max = 50)

# View the resulting model
km.pokemon
## K-means clustering with 4 clusters of sizes 283, 114, 115, 288
## 
## Cluster means:
##   HitPoints    Attack   Defense SpecialAttack SpecialDefense     Speed
## 1  50.29682  54.03180  51.62898      47.90459       49.15548  49.74912
## 2  89.20175 121.09649  92.73684     120.45614       97.67544 100.44737
## 3  71.30435  92.91304 121.42609      63.89565       88.23478  52.36522
## 4  79.18056  81.31944  69.19097      82.01042       77.53125  80.10417
## 
## Clustering vector:
##   [1] 1 4 4 2 1 4 4 2 2 1 4 4 2 1 1 4 1 1 4 4 1 1 4 2 1 4 1 4 1 4 1 4 1 3 1 1 4
##  [38] 1 1 4 1 4 1 4 1 4 1 4 1 4 4 1 3 1 4 1 4 1 4 1 4 1 4 1 2 1 1 4 1 4 4 2 1 4
##  [75] 3 1 4 4 1 4 1 3 3 4 4 1 3 3 1 4 1 1 4 1 4 1 4 1 3 1 4 4 2 3 1 4 1 3 1 4 1
## [112] 4 1 3 4 3 1 1 3 1 3 4 4 4 2 1 4 1 4 1 4 4 4 4 4 4 3 2 4 1 4 2 4 1 1 4 4 2
## [149] 4 1 3 1 3 4 2 4 2 2 2 1 4 2 2 2 2 2 1 4 4 1 4 4 1 4 4 1 4 1 4 1 4 1 1 4 1
## [186] 4 1 1 1 1 4 1 4 1 1 4 2 3 1 4 3 4 1 1 4 1 1 4 4 1 3 4 3 4 4 4 1 4 4 1 3 4
## [223] 3 3 3 1 4 4 3 3 3 4 3 4 1 4 1 3 1 4 1 1 4 1 4 3 1 4 2 4 1 3 4 4 1 1 3 1 1
## [260] 1 4 4 2 2 3 1 4 2 2 2 2 2 1 4 4 2 1 4 2 2 1 4 4 2 1 4 1 4 1 1 4 1 1 1 1 4
## [297] 1 1 4 1 4 1 4 1 1 4 2 1 4 1 4 1 4 2 1 4 1 1 1 4 1 4 1 3 1 1 1 3 1 3 1 3 3
## [334] 3 1 4 4 1 4 2 4 4 4 4 4 1 4 1 4 2 4 4 1 4 2 3 1 4 1 1 1 4 1 4 1 4 2 4 4 4
## [371] 4 1 4 1 4 1 3 1 3 1 3 1 4 4 3 1 4 2 1 3 4 4 4 2 1 1 4 2 1 4 4 1 3 3 3 1 1
## [408] 3 2 2 1 3 3 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 1 3 3 1 4 4 1 4 4 1 1 4
## [445] 1 4 1 1 1 1 4 1 4 1 4 3 3 1 3 3 3 4 1 3 4 1 4 1 4 1 4 4 1 4 1 4 2 4 4 1 4
## [482] 1 1 4 1 3 1 1 1 4 3 1 4 2 2 4 1 2 2 1 3 1 3 1 4 4 1 4 1 1 4 2 4 3 3 3 3 2
## [519] 2 4 4 3 4 3 4 4 4 2 3 3 4 4 4 4 4 4 4 3 2 2 2 2 2 2 2 2 3 4 2 2 2 2 2 2 1
## [556] 4 4 1 4 4 1 4 4 1 4 1 1 3 1 4 1 4 1 4 1 4 1 4 1 1 4 1 4 1 3 3 1 4 1 4 4 3
## [593] 1 3 3 1 4 4 3 4 1 3 4 1 1 4 1 4 1 4 4 1 1 4 1 4 4 4 1 3 1 3 4 1 3 3 3 4 2
## [630] 1 4 1 4 1 4 1 4 4 1 1 4 1 4 1 4 4 1 4 4 1 3 1 4 1 4 4 1 4 1 3 1 3 3 1 4 4
## [667] 1 4 1 1 4 1 4 2 1 4 4 1 4 4 1 4 3 1 3 1 3 3 1 4 1 3 4 3 1 4 2 1 4 2 2 2 2
## [704] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 3 1 4 4 1 4 4 1 4 1 1 4 1 1 4 1 4 1 1 4
## [741] 1 4 1 4 4 1 4 4 1 3 2 3 1 4 1 4 1 4 1 3 1 3 1 4 1 4 1 3 1 4 4 4 4 3 1 4 2
## [778] 4 1 4 1 1 1 1 3 3 3 3 1 3 1 4 2 2 2 3 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 513476.6 408271.9 455965.7 871531.3
##  (between_SS / total_SS =  47.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# Plot of Defense vs. Speed by cluster membership
plot(pokemon[, c("Defense", "Speed")],
     col = km.pokemon$cluster,
     main = paste("k-means clustering of Pokemon with", k, "clusters"),
     xlab = "Defense", ylab = "Speed")
Figure 1.5 Defense vs. Speed by cluster membership

Figure 1.5 Defense vs. Speed by cluster membership

Chapter 2 => Hierarchical clustering

  1. Typically used when the number of clusters in not known a head of time.
  2. Two approaches. bottom up and top down. we will focus on bottom up process
    • assign each point to its own cluster
    • join the two closest clusters/points into a new cluster
    • keep going until there is one cluster
    • they way you calculate the distance between clusters is a parameter and will be covered later
  3. We have to first calculate the euclidean distance between all points (makes a big matrix) using the dist() function
    • this is passed into the hclust() function
# Create hierarchical clustering model: hclust.out
hclust.out <- hclust(dist(x))

# Inspect the result
summary(hclust.out)
##             Length Class  Mode     
## merge       98     -none- numeric  
## height      49     -none- numeric  
## order       50     -none- numeric  
## labels       0     -none- NULL     
## method       1     -none- character
## call         2     -none- call     
## dist.method  1     -none- character

Selecting number of clusters

  1. You can build a dendrogram of the distances between points
  2. Then you either pick the number of clusters or the height (aka distance) that you want to split the cluster, think of this as drawing a horizontal line across the dendrogram
  3. The cutree() function in R lets you split the hierarchical clusters into set clusters by number or by distance (height)
plot(hclust.out)
abline(h = 7, col = "red")

# Cut by height
cutree(hclust.out, h = 7)
##  [1] 1 2 3 1 2 4 2 3 2 4 5 4 3 2 3 3 2 1 6 3 2 3 3 3 2 3 1 5 6 2 3 2 3 2 6 6 4 5
## [39] 1 3 5 2 5 2 3 3 3 6 2 1
# Cut by number of clusters
cutree(hclust.out, k = 3)
##  [1] 1 2 1 1 2 3 2 1 2 3 1 3 1 2 1 1 2 1 2 1 2 1 1 1 2 1 1 1 2 2 1 2 1 2 2 2 3 1
## [39] 1 1 1 2 1 2 1 1 1 2 2 1

Clustering linkage and practical matters

  1. Four methods to measure distance between clusters.
    • complete: pairwise similarity between all observations in cluster 1 and 2, uses largest of similarities.
    • single: same as above but uses the smallest of similarities.
    • average: same as above but uses average of similarities.
    • centroid: finds centroid of cluster 1 and 2, uses similarity between tow centroids.

2.rule of thumb. * complete and average produce more balanced trees and are more commonly used. * single fuses observations in one at a time and produces more unbalanced trees * centroid can create inversion where clusters are put below single values. its not used often

3.practical matters * data needs to be scaled so that features have the same mean and standard deviation * normalized features have a mean of zero and a sd of one

# Cluster using complete linkage: hclust.complete
hclust.complete <- hclust(dist(x), method = "complete")

# Cluster using average linkage: hclust.average
hclust.average <- hclust(dist(x), method = "average")

# Cluster using single linkage: hclust.single
hclust.single <- hclust(dist(x), method = "single")

# Plot dendrogram of hclust.complete
plot(hclust.complete, main = "Complete")

# Plot dendrogram of hclust.average
plot(hclust.average, main = "Average")

# Plot dendrogram of hclust.single
plot(hclust.single, main = "Single")

Practical matters: scaling

Recall from the video that clustering real data may require scaling the features if they have different distributions. So far in this chapter, you have been working with synthetic data that did not need scaling.

# View column standard deviations
apply(pokemon, 2, sd)
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       25.53467       32.45737       31.18350       32.72229       27.82892 
##          Speed 
##       29.06047
# Scale the data
pokemon.scaled <- scale(pokemon)

# Create hierarchical clustering model: hclust.pokemon
hclust.pokemon <- hclust(dist(pokemon.scaled), method = "complete")

Comparing kmeans() and hclust()

  1. The scaled hierarchical cluster pretty much puts all observations in cluster 1
    • not sure how I feel about that
  2. The instructor says we can’t really say one of these is better.
  3. This was a confusing example thought. I would naturally expect more spread out clusters as with the k-means clusters
# Apply cutree() to hclust.pokemon: cut.pokemon
cut.pokemon <- cutree(hclust.pokemon, k = 3)

# Compare methods
table(km.pokemon$cluster, cut.pokemon)
##    cut.pokemon
##       1   2   3
##   1 283   0   0
##   2 114   0   0
##   3 114   0   1
##   4 277  11   0

Chapter 3 => Dimensionality reduction with PCA

Introduction to PCA

  1. Two main goals of dimensional reduction
    • find structure in features
    • aid in visualization
  2. PCA has 3 goals
    • find a linear combination of variables to create principle components
    • maintain as much variance in the data as possible
    • principal components are uncorrelated (i.e. orthogonal to each other)
  3. intuition
    • with an x y correlation scatter plot, the best 1 dimension to explain the variance in the data is the linear regression line
    • this is the first principal component
    • then the distance of the points from the line is the component score (I don’t really understand this part, but I get how the line is simple way to explain the two dimensional data and explains most of the variation in the data.)

Example: Principle components with iris dataset * center and scale - for each point, subtract the mean and divide by the sd * the summary of the model shows you the proportion of variance explained by each principal component * I think the rotation is the distance of the point from each principal component or something like that.

# Perform scaled PCA: pr.out
pr.out <- prcomp(x = pokemon, scale = T, center = T)
summary(pr.out)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6
## Standard deviation     1.6466 1.0457 0.8825 0.8489 0.65463 0.51681
## Proportion of Variance 0.4519 0.1822 0.1298 0.1201 0.07142 0.04451
## Cumulative Proportion  0.4519 0.6342 0.7640 0.8841 0.95549 1.00000
# Inspect model output
pr.out
## Standard deviations (1, .., p=6):
## [1] 1.6466450 1.0457158 0.8824654 0.8489201 0.6546298 0.5168055
## 
## Rotation (n x k) = (6 x 6):
##                      PC1         PC2         PC3        PC4         PC5
## HitPoints      0.3898858  0.08483455 -0.47192614  0.7176913 -0.21999056
## Attack         0.4392537 -0.01182493 -0.59415339 -0.4058359  0.19025457
## Defense        0.3637473  0.62878867  0.06933913 -0.4192373 -0.05903197
## SpecialAttack  0.4571623 -0.30541446  0.30561186  0.1475166  0.73534497
## SpecialDefense 0.4485704  0.23909670  0.56559403  0.1854448 -0.30019970
## Speed          0.3354405 -0.66846305  0.07851327 -0.2971625 -0.53016082
##                       PC6
## HitPoints       0.2336690
## Attack         -0.5029896
## Defense         0.5368986
## SpecialAttack   0.2045304
## SpecialDefense -0.5451707
## Speed           0.2551400
pr.out$rotation
##                      PC1         PC2         PC3        PC4         PC5
## HitPoints      0.3898858  0.08483455 -0.47192614  0.7176913 -0.21999056
## Attack         0.4392537 -0.01182493 -0.59415339 -0.4058359  0.19025457
## Defense        0.3637473  0.62878867  0.06933913 -0.4192373 -0.05903197
## SpecialAttack  0.4571623 -0.30541446  0.30561186  0.1475166  0.73534497
## SpecialDefense 0.4485704  0.23909670  0.56559403  0.1854448 -0.30019970
## Speed          0.3354405 -0.66846305  0.07851327 -0.2971625 -0.53016082
##                       PC6
## HitPoints       0.2336690
## Attack         -0.5029896
## Defense         0.5368986
## SpecialAttack   0.2045304
## SpecialDefense -0.5451707
## Speed           0.2551400
pr.out$scale
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       25.53467       32.45737       31.18350       32.72229       27.82892 
##          Speed 
##       29.06047
pr.out$center
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       69.25875       79.00125       73.84250       72.82000       71.90250 
##          Speed 
##       68.27750
head(pr.out$x,10)
##              PC1         PC2        PC3         PC4         PC5         PC6
##  [1,] -1.5554017 -0.02146869  0.6660872  0.18406113 0.403554706 -0.30281472
##  [2,] -0.3626397 -0.05023711  0.6674959  0.26908616 0.225647228 -0.19436515
##  [3,]  1.2793512 -0.06268101  0.6235239  0.33118417 0.001544326 -0.06813435
##  [4,]  2.6192779  0.70382271  0.9949151 -0.19919599 0.309976280  0.08732570
##  [5,] -1.7571845 -0.70573748  0.4111965 -0.26843393 0.168771440 -0.06932488
##  [6,] -0.4353607 -0.74735577  0.4059051 -0.04938257 0.061009227  0.13969565
##  [7,]  1.3323690 -0.84380142  0.4459891  0.05328867 0.039156957  0.32218086
##  [8,]  2.6332254 -0.39114750 -0.1265620 -0.87086758 0.718241514  0.30875625
##  [9,]  2.7851485 -1.06001436  1.1565731  0.22853461 0.956385206 -0.26293575
## [10,] -1.6463371  0.47561588  0.5726312 -0.10048350 0.086209248 -0.11271708

Visualizing and interpreting PCA results

  1. biplot
    • shows all of the original observations as points plotted by the first 2 principal components
    • it also shows the original features as vectors mapped onto the first 2 principal components with the iris data notice how the pedal width and length are pointing in the same direction
    • this means the variables are correlated. one explains the about all the variance of the other.
    • I found a nicer biplot function PCbiplot() on stack overflow, made with ggplot2 and modified this to work a little better. I use this in most cases because its much clearer.
  2. scree plot
    • either shows the proportion of variance explained by each principal component or the cumulative variance explained by successive components
    • all of the variance is explained when the number of components matches the number of original features/variables
    • a couple steps are required to get the variance from the sd for each component for the plot
# Variability of each principal component: pr.var
pr.var <- pr.out$sdev^2

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)
pve
## [1] 0.45190665 0.18225358 0.12979086 0.12011089 0.07142337 0.04451466
# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

Practical issues with PCA

  1. Three things need to be considered for a successful PCA:
    1. scaling the data
    2. missing data
      • either drop it or impute it
    3. features that are categories
      • drop it or encode it as numbers
      • this is not covered in this class (much more pre-processing is needed in real world applications than we are doing here I believe)

Importance of scaling:

* the mtcars dataset has very different means and sd
* look at the results of the biplots with and without scaling
* without scaling the disp and hp overwhelm the other features simply because they have much higher variance in their units of measure

Sometimes scaling is appropriate when the variances of the variables are substantially different. This is commonly the case when variables have different units of measurement, for example, degrees Fahrenheit (temperature) and miles (distance). Making the decision to use scaling is an important step in performing a principal component analysis.

# Mean of each variable
colMeans(pokemon)
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       69.25875       79.00125       73.84250       72.82000       71.90250 
##          Speed 
##       68.27750
# Standard deviation of each variable
apply(pokemon, 2, sd)
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       25.53467       32.45737       31.18350       32.72229       27.82892 
##          Speed 
##       29.06047
# PCA model with scaling: pr.with.scaling
pr.with.scaling <- prcomp(pokemon, scale = TRUE)

# PCA model without scaling: pr.without.scaling
pr.without.scaling <- prcomp(pokemon, scale = FALSE)

# Create biplots of both for comparison
biplot(pr.with.scaling)

biplot(pr.without.scaling)

Chapter 4 => Putting it all together with a Case Study

Introduction to the case study
  1. Data from paper by Bennett and Mangasarian.
    • “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”
  2. Human breast mass that was or was not malignant
    • 10 features measured of each cell nuclei (I see 30 though)
    • each features is a summary statistic of the cells in that mass
    • includes diagnosis (target) - can be used for supervised learning but will not be used during the unsupervised analysis
  3. overall steps
    • download and prepare data
    • EDA
    • perform PCA and interpret results
    • complete twy tubes of clustering
    • understand and compare the two types
    • combine PCA and clustering

Preparing the data

url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"
# Download the data: wisc.df
wisc.df <- read.csv(url)

# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[, 3:32])

# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id

# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis == "M")
head(wisc.data)
##          radius_mean texture_mean perimeter_mean area_mean smoothness_mean
## 842302         17.99        10.38         122.80    1001.0         0.11840
## 842517         20.57        17.77         132.90    1326.0         0.08474
## 84300903       19.69        21.25         130.00    1203.0         0.10960
## 84348301       11.42        20.38          77.58     386.1         0.14250
## 84358402       20.29        14.34         135.10    1297.0         0.10030
## 843786         12.45        15.70          82.57     477.1         0.12780
##          compactness_mean concavity_mean concave.points_mean symmetry_mean
## 842302            0.27760         0.3001             0.14710        0.2419
## 842517            0.07864         0.0869             0.07017        0.1812
## 84300903          0.15990         0.1974             0.12790        0.2069
## 84348301          0.28390         0.2414             0.10520        0.2597
## 84358402          0.13280         0.1980             0.10430        0.1809
## 843786            0.17000         0.1578             0.08089        0.2087
##          fractal_dimension_mean radius_se texture_se perimeter_se area_se
## 842302                  0.07871    1.0950     0.9053        8.589  153.40
## 842517                  0.05667    0.5435     0.7339        3.398   74.08
## 84300903                0.05999    0.7456     0.7869        4.585   94.03
## 84348301                0.09744    0.4956     1.1560        3.445   27.23
## 84358402                0.05883    0.7572     0.7813        5.438   94.44
## 843786                  0.07613    0.3345     0.8902        2.217   27.19
##          smoothness_se compactness_se concavity_se concave.points_se
## 842302        0.006399        0.04904      0.05373           0.01587
## 842517        0.005225        0.01308      0.01860           0.01340
## 84300903      0.006150        0.04006      0.03832           0.02058
## 84348301      0.009110        0.07458      0.05661           0.01867
## 84358402      0.011490        0.02461      0.05688           0.01885
## 843786        0.007510        0.03345      0.03672           0.01137
##          symmetry_se fractal_dimension_se radius_worst texture_worst
## 842302       0.03003             0.006193        25.38         17.33
## 842517       0.01389             0.003532        24.99         23.41
## 84300903     0.02250             0.004571        23.57         25.53
## 84348301     0.05963             0.009208        14.91         26.50
## 84358402     0.01756             0.005115        22.54         16.67
## 843786       0.02165             0.005082        15.47         23.75
##          perimeter_worst area_worst smoothness_worst compactness_worst
## 842302            184.60     2019.0           0.1622            0.6656
## 842517            158.80     1956.0           0.1238            0.1866
## 84300903          152.50     1709.0           0.1444            0.4245
## 84348301           98.87      567.7           0.2098            0.8663
## 84358402          152.20     1575.0           0.1374            0.2050
## 843786            103.40      741.6           0.1791            0.5249
##          concavity_worst concave.points_worst symmetry_worst
## 842302            0.7119               0.2654         0.4601
## 842517            0.2416               0.1860         0.2750
## 84300903          0.4504               0.2430         0.3613
## 84348301          0.6869               0.2575         0.6638
## 84358402          0.4000               0.1625         0.2364
## 843786            0.5355               0.1741         0.3985
##          fractal_dimension_worst
## 842302                   0.11890
## 842517                   0.08902
## 84300903                 0.08758
## 84348301                 0.17300
## 84358402                 0.07678
## 843786                   0.12440

Performing PCA

The next step in your analysis is to perform PCA on wisc.data.

# Check column means and standard deviations
colMeans(wisc.data)
##             radius_mean            texture_mean          perimeter_mean 
##            1.412729e+01            1.928965e+01            9.196903e+01 
##               area_mean         smoothness_mean        compactness_mean 
##            6.548891e+02            9.636028e-02            1.043410e-01 
##          concavity_mean     concave.points_mean           symmetry_mean 
##            8.879932e-02            4.891915e-02            1.811619e-01 
##  fractal_dimension_mean               radius_se              texture_se 
##            6.279761e-02            4.051721e-01            1.216853e+00 
##            perimeter_se                 area_se           smoothness_se 
##            2.866059e+00            4.033708e+01            7.040979e-03 
##          compactness_se            concavity_se       concave.points_se 
##            2.547814e-02            3.189372e-02            1.179614e-02 
##             symmetry_se    fractal_dimension_se            radius_worst 
##            2.054230e-02            3.794904e-03            1.626919e+01 
##           texture_worst         perimeter_worst              area_worst 
##            2.567722e+01            1.072612e+02            8.805831e+02 
##        smoothness_worst       compactness_worst         concavity_worst 
##            1.323686e-01            2.542650e-01            2.721885e-01 
##    concave.points_worst          symmetry_worst fractal_dimension_worst 
##            1.146062e-01            2.900756e-01            8.394582e-02
apply(wisc.data, 2, sd)
##             radius_mean            texture_mean          perimeter_mean 
##            3.524049e+00            4.301036e+00            2.429898e+01 
##               area_mean         smoothness_mean        compactness_mean 
##            3.519141e+02            1.406413e-02            5.281276e-02 
##          concavity_mean     concave.points_mean           symmetry_mean 
##            7.971981e-02            3.880284e-02            2.741428e-02 
##  fractal_dimension_mean               radius_se              texture_se 
##            7.060363e-03            2.773127e-01            5.516484e-01 
##            perimeter_se                 area_se           smoothness_se 
##            2.021855e+00            4.549101e+01            3.002518e-03 
##          compactness_se            concavity_se       concave.points_se 
##            1.790818e-02            3.018606e-02            6.170285e-03 
##             symmetry_se    fractal_dimension_se            radius_worst 
##            8.266372e-03            2.646071e-03            4.833242e+00 
##           texture_worst         perimeter_worst              area_worst 
##            6.146258e+00            3.360254e+01            5.693570e+02 
##        smoothness_worst       compactness_worst         concavity_worst 
##            2.283243e-02            1.573365e-01            2.086243e-01 
##    concave.points_worst          symmetry_worst fractal_dimension_worst 
##            6.573234e-02            6.186747e-02            1.806127e-02
# Execute PCA, scaling if appropriate: wisc.pr
wisc.pr <- prcomp(wisc.data, scale = T, center = T)

# Look at summary of results
summary(wisc.pr)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.6444 2.3857 1.67867 1.40735 1.28403 1.09880 0.82172
## Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025 0.02251
## Cumulative Proportion  0.4427 0.6324 0.72636 0.79239 0.84734 0.88759 0.91010
##                            PC8    PC9    PC10   PC11    PC12    PC13    PC14
## Standard deviation     0.69037 0.6457 0.59219 0.5421 0.51104 0.49128 0.39624
## Proportion of Variance 0.01589 0.0139 0.01169 0.0098 0.00871 0.00805 0.00523
## Cumulative Proportion  0.92598 0.9399 0.95157 0.9614 0.97007 0.97812 0.98335
##                           PC15    PC16    PC17    PC18    PC19    PC20   PC21
## Standard deviation     0.30681 0.28260 0.24372 0.22939 0.22244 0.17652 0.1731
## Proportion of Variance 0.00314 0.00266 0.00198 0.00175 0.00165 0.00104 0.0010
## Cumulative Proportion  0.98649 0.98915 0.99113 0.99288 0.99453 0.99557 0.9966
##                           PC22    PC23   PC24    PC25    PC26    PC27    PC28
## Standard deviation     0.16565 0.15602 0.1344 0.12442 0.09043 0.08307 0.03987
## Proportion of Variance 0.00091 0.00081 0.0006 0.00052 0.00027 0.00023 0.00005
## Cumulative Proportion  0.99749 0.99830 0.9989 0.99942 0.99969 0.99992 0.99997
##                           PC29    PC30
## Standard deviation     0.02736 0.01153
## Proportion of Variance 0.00002 0.00000
## Cumulative Proportion  1.00000 1.00000

Interpreting PCA results

We can see from the charts that pc1 and pc2 overlap less than pc1 and pc3. This is expected as pc1 and pc2 are meant to be orthogonal and explain different variance pc2 and pc3 overlap more then either of them overlap with pc1

par(mfrow=c(2,2))

# Create a biplot of wisc.pr
biplot(wisc.pr)

# Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC2")
# Repeat for components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC3")

# Do additional data exploration of your choosing below (optional)
plot(wisc.pr$x[, c(2, 3)], col = (diagnosis + 1), 
     xlab = "PC2", ylab = "PC3")

Variance explained

# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))

# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))

# Calculate variability of each component
pr.var <- wisc.pr$sdev^2

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)

# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

PCA Review and next steps

As part of the preparation for hierarchical clustering, distance between all pairs of observations are computed. Furthermore, there are different ways to link clusters together, with single, complete, and average being the most common linkage methods.

wisc.pr$rotation[1:10,1:2]
##                                PC1         PC2
## radius_mean            -0.21890244  0.23385713
## texture_mean           -0.10372458  0.05970609
## perimeter_mean         -0.22753729  0.21518136
## area_mean              -0.22099499  0.23107671
## smoothness_mean        -0.14258969 -0.18611302
## compactness_mean       -0.23928535 -0.15189161
## concavity_mean         -0.25840048 -0.06016536
## concave.points_mean    -0.26085376  0.03476750
## symmetry_mean          -0.13816696 -0.19034877
## fractal_dimension_mean -0.06436335 -0.36657547
# Scale the wisc.data data: data.scaled
data.scaled <- scale(wisc.data)
head(data.scaled)
##          radius_mean texture_mean perimeter_mean  area_mean smoothness_mean
## 842302     1.0960995   -2.0715123      1.2688173  0.9835095       1.5670875
## 842517     1.8282120   -0.3533215      1.6844726  1.9070303      -0.8262354
## 84300903   1.5784992    0.4557859      1.5651260  1.5575132       0.9413821
## 84348301  -0.7682333    0.2535091     -0.5921661 -0.7637917       3.2806668
## 84358402   1.7487579   -1.1508038      1.7750113  1.8246238       0.2801253
## 843786    -0.4759559   -0.8346009     -0.3868077 -0.5052059       2.2354545
##          compactness_mean concavity_mean concave.points_mean symmetry_mean
## 842302          3.2806281     2.65054179           2.5302489   2.215565542
## 842517         -0.4866435    -0.02382489           0.5476623   0.001391139
## 84300903        1.0519999     1.36227979           2.0354398   0.938858720
## 84348301        3.3999174     1.91421287           1.4504311   2.864862154
## 84358402        0.5388663     1.36980615           1.4272370  -0.009552062
## 843786          1.2432416     0.86554001           0.8239307   1.004517928
##          fractal_dimension_mean  radius_se texture_se perimeter_se    area_se
## 842302                2.2537638  2.4875451 -0.5647681    2.8305403  2.4853907
## 842517               -0.8678888  0.4988157 -0.8754733    0.2630955  0.7417493
## 84300903             -0.3976580  1.2275958 -0.7793976    0.8501802  1.1802975
## 84348301              4.9066020  0.3260865 -0.1103120    0.2863415 -0.2881246
## 84358402             -0.5619555  1.2694258 -0.7895490    1.2720701  1.1893103
## 843786                1.8883435 -0.2548461 -0.5921406   -0.3210217 -0.2890039
##          smoothness_se compactness_se concavity_se concave.points_se
## 842302      -0.2138135     1.31570389    0.7233897        0.66023900
## 842517      -0.6048187    -0.69231710   -0.4403926        0.25993335
## 84300903    -0.2967439     0.81425704    0.2128891        1.42357487
## 84348301     0.6890953     2.74186785    0.8187979        1.11402678
## 84358402     1.4817634    -0.04847723    0.8277425        1.14319885
## 843786       0.1562093     0.44515196    0.1598845       -0.06906279
##          symmetry_se fractal_dimension_se radius_worst texture_worst
## 842302     1.1477468           0.90628565    1.8850310   -1.35809849
## 842517    -0.8047423          -0.09935632    1.8043398   -0.36887865
## 84300903   0.2368272           0.29330133    1.5105411   -0.02395331
## 84348301   4.7285198           2.04571087   -0.2812170    0.13386631
## 84358402  -0.3607748           0.49888916    1.2974336   -1.46548091
## 843786     0.1340009           0.48641784   -0.1653528   -0.31356043
##          perimeter_worst area_worst smoothness_worst compactness_worst
## 842302         2.3015755  1.9994782        1.3065367         2.6143647
## 842517         1.5337764  1.8888270       -0.3752817        -0.4300658
## 84300903       1.3462906  1.4550043        0.5269438         1.0819801
## 84348301      -0.2497196 -0.5495377        3.3912907         3.8899747
## 84358402       1.3373627  1.2196511        0.2203623        -0.3131190
## 843786        -0.1149083 -0.2441054        2.0467119         1.7201029
##          concavity_worst concave.points_worst symmetry_worst
## 842302         2.1076718            2.2940576      2.7482041
## 842517        -0.1466200            1.0861286     -0.2436753
## 84300903       0.8542223            1.9532817      1.1512420
## 84348301       1.9878392            2.1738732      6.0407261
## 84358402       0.6126397            0.7286181     -0.8675896
## 843786         1.2621327            0.9050914      1.7525273
##          fractal_dimension_worst
## 842302                 1.9353117
## 842517                 0.2809428
## 84300903               0.2012142
## 84348301               4.9306719
## 84358402              -0.3967505
## 843786                 2.2398308
# Calculate the (Euclidean) distances: data.dist
data.dist <- dist(data.scaled)

# Create a hierarchical clustering model: wisc.hclust
wisc.hclust <- hclust(data.dist, method = "complete")
plot(wisc.hclust, col = "blue")

Selecting Number of Clusters

# Cut tree so that it has 4 clusters: wisc.hclust.clusters
wisc.hclust.clusters <- cutree(wisc.hclust, k = 4)

# Compare cluster membership to actual diagnoses
table(wisc.hclust.clusters, diagnosis)
##                     diagnosis
## wisc.hclust.clusters   0   1
##                    1  12 165
##                    2   2   5
##                    3 343  40
##                    4   0   2
# count out of place observations based on cluster 
# basically just summing the row mins here
sum(apply(table(wisc.hclust.clusters, diagnosis), 1, min)) 
## [1] 54
# Looks like 54 tumors that are not clear with the diagnosis based on the general cluster

k-means Clustering and Comparing Results

As you now know, there are two main types of clustering: hierarchical and k-means.

# Create a k-means model on wisc.data: wisc.km
wisc.km <- kmeans(scale(wisc.data), centers = 2, nstart = 20)

# Compare k-means to actual diagnoses
table(wisc.km$cluster, diagnosis)
##    diagnosis
##       0   1
##   1  14 175
##   2 343  37
sum(apply(table(wisc.km$cluster, diagnosis), 1, min))
## [1] 51
# Compare k-means to hierarchical clustering
table(wisc.hclust.clusters, wisc.km$cluster)
##                     
## wisc.hclust.clusters   1   2
##                    1 160  17
##                    2   7   0
##                    3  20 363
##                    4   2   0
sum(apply(table(wisc.hclust.clusters, wisc.km$cluster), 1, min))
## [1] 37

Clustering on PCA Results

Recall from earlier exercises that the PCA model required significantly fewer features to describe 80% and 95% of the variability of the data. In addition to normalizing data and potentially avoiding over fitting, PCA also uncorrelated the variables, sometimes improving the performance of other modeling techniques.

# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")

# Cut model into 4 clusters: wisc.pr.hclust.clusters
wisc.pr.hclust.clusters <- cutree(wisc.pr.hclust, k = 4)

# Compare to actual diagnoses
t <- table(diagnosis, wisc.pr.hclust.clusters)
t
##          wisc.pr.hclust.clusters
## diagnosis   1   2   3   4
##         0   5 350   2   0
##         1 113  97   0   2
sum(apply(t, 1, min))
## [1] 0
# Compare to k-means and hierarchical
t <- table(diagnosis, wisc.hclust.clusters)
t
##          wisc.hclust.clusters
## diagnosis   1   2   3   4
##         0  12   2 343   0
##         1 165   5  40   2
sum(apply(t, 1, min))
## [1] 2
t <- table(diagnosis, wisc.km$cluster)
t
##          
## diagnosis   1   2
##         0  14 343
##         1 175  37
sum(apply(t, 1, min))
## [1] 51
  1. It looks like the 2 cluster k-means does the best job
  2. The whole purpose of this is to see if the results of clustering could be useful in a supervised learning process.

I think it might be worth adding the k-means clusters to a model. Maybe. I guess I could just try it with and with out and see which is best at predicting now that I know how to do that. : )

Conclusion

  1. This was a VERY high level overview on the topics of hierarchical and k-means clustering and PCA
  2. I think it covers some good concepts like variable models selection, interpretation, and scaling data
  3. I did get a little intuition on how the algorithms work
  4. I don’t feel like I know these topics well or that I am ready to base business decisions on my knowledge of these model techniques.
  5. It was a good way to get started though and I am now definitely ready to dig into these techniques more.