1 Introduction

1.1 Preparing the data

We can get our dataframe using the read.csv() function. We will assign the result to wisc.df.

url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"

# Download the data: wisc.df
wisc.df <- read.csv(url)

Let’s check the first rows of the dataframe:

head(wisc.df)

We can use as.matrix() to convert the features of the data (in columns 3 through 32) to a matrix.

# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[3:32])

We can assign the row names of wisc.data the values currently contained in the id column of wisc.df. This will help us keep track of the different observations throughout the modeling process.

# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id

And we also can set a vector called diagnosis to be 1 if a diagnosis is malignant (“M”) and 0 otherwise. Note that R coerces TRUE to 1 and FALSE to 0.

# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis == "M")

How many observations are in this dataset?

nrow(wisc.data)
[1] 569

How many variables/features in the data are suffixed with _mean?

length(grep(pattern="_mean", x = colnames(wisc.data)))
[1] 10

How many of the observations have a malignant diagnosis?

length(which(diagnosis==1))
[1] 212

1.2 Performing PCA

The next step is to perform PCA on wisc.data.

It’s important to check if the data need to be scaled before performing PCA. Two common reasons for scaling data:

  1. The input variables use different units of measurement.
  2. The input variables have significantly different variances.
# Check column means and standard deviations
colMeans(wisc.data)
            radius_mean            texture_mean          perimeter_mean 
           1.412729e+01            1.928965e+01            9.196903e+01 
              area_mean         smoothness_mean        compactness_mean 
           6.548891e+02            9.636028e-02            1.043410e-01 
         concavity_mean     concave.points_mean           symmetry_mean 
           8.879932e-02            4.891915e-02            1.811619e-01 
 fractal_dimension_mean               radius_se              texture_se 
           6.279761e-02            4.051721e-01            1.216853e+00 
           perimeter_se                 area_se           smoothness_se 
           2.866059e+00            4.033708e+01            7.040979e-03 
         compactness_se            concavity_se       concave.points_se 
           2.547814e-02            3.189372e-02            1.179614e-02 
            symmetry_se    fractal_dimension_se            radius_worst 
           2.054230e-02            3.794904e-03            1.626919e+01 
          texture_worst         perimeter_worst              area_worst 
           2.567722e+01            1.072612e+02            8.805831e+02 
       smoothness_worst       compactness_worst         concavity_worst 
           1.323686e-01            2.542650e-01            2.721885e-01 
   concave.points_worst          symmetry_worst fractal_dimension_worst 
           1.146062e-01            2.900756e-01            8.394582e-02 
apply(wisc.data, 2, sd)
            radius_mean            texture_mean          perimeter_mean 
           3.524049e+00            4.301036e+00            2.429898e+01 
              area_mean         smoothness_mean        compactness_mean 
           3.519141e+02            1.406413e-02            5.281276e-02 
         concavity_mean     concave.points_mean           symmetry_mean 
           7.971981e-02            3.880284e-02            2.741428e-02 
 fractal_dimension_mean               radius_se              texture_se 
           7.060363e-03            2.773127e-01            5.516484e-01 
           perimeter_se                 area_se           smoothness_se 
           2.021855e+00            4.549101e+01            3.002518e-03 
         compactness_se            concavity_se       concave.points_se 
           1.790818e-02            3.018606e-02            6.170285e-03 
            symmetry_se    fractal_dimension_se            radius_worst 
           8.266372e-03            2.646071e-03            4.833242e+00 
          texture_worst         perimeter_worst              area_worst 
           6.146258e+00            3.360254e+01            5.693570e+02 
       smoothness_worst       compactness_worst         concavity_worst 
           2.283243e-02            1.573365e-01            2.086243e-01 
   concave.points_worst          symmetry_worst fractal_dimension_worst 
           6.573234e-02            6.186747e-02            1.806127e-02 
# Execute PCA, scaling if appropriate: wisc.pr
wisc.pr <- prcomp(x = wisc.data, scale = TRUE)

# Look at summary of results
summary(wisc.pr)
Importance of components:
                          PC1    PC2     PC3     PC4     PC5     PC6
Standard deviation     3.6444 2.3857 1.67867 1.40735 1.28403 1.09880
Proportion of Variance 0.4427 0.1897 0.09393 0.06602 0.05496 0.04025
Cumulative Proportion  0.4427 0.6324 0.72636 0.79239 0.84734 0.88759
                           PC7     PC8    PC9    PC10   PC11    PC12
Standard deviation     0.82172 0.69037 0.6457 0.59219 0.5421 0.51104
Proportion of Variance 0.02251 0.01589 0.0139 0.01169 0.0098 0.00871
Cumulative Proportion  0.91010 0.92598 0.9399 0.95157 0.9614 0.97007
                          PC13    PC14    PC15    PC16    PC17    PC18
Standard deviation     0.49128 0.39624 0.30681 0.28260 0.24372 0.22939
Proportion of Variance 0.00805 0.00523 0.00314 0.00266 0.00198 0.00175
Cumulative Proportion  0.97812 0.98335 0.98649 0.98915 0.99113 0.99288
                          PC19    PC20   PC21    PC22    PC23   PC24
Standard deviation     0.22244 0.17652 0.1731 0.16565 0.15602 0.1344
Proportion of Variance 0.00165 0.00104 0.0010 0.00091 0.00081 0.0006
Cumulative Proportion  0.99453 0.99557 0.9966 0.99749 0.99830 0.9989
                          PC25    PC26    PC27    PC28    PC29    PC30
Standard deviation     0.12442 0.09043 0.08307 0.03987 0.02736 0.01153
Proportion of Variance 0.00052 0.00027 0.00023 0.00005 0.00002 0.00000
Cumulative Proportion  0.99942 0.99969 0.99992 0.99997 1.00000 1.00000

1.3 Interpreting PCA results

Now we’ll use some visualizations to better understand your PCA model.

We’ll run into some common challenges with using biplots on real-world data containing a non-trivial number of observations and variables, then we’ll look at some alternative visualizations.

# Create a biplot of wisc.pr
biplot(wisc.pr)

#Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC2")


# Repeat for components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC3")

Because principal component 2 explains more variance in the original data than principal component 3, you can see that the first plot has a cleaner cut separating the two subgroups.

plot(wisc.pr$x[, c(1, 4)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC3")

1.4 Variance explained

We will produce scree plots showing the proportion of variance explained as the number of principal components increases. The data from PCA must be prepared for these plots, as there is not a built-in function in R to create them directly from the PCA model.

As we look at these plots, we are asking if there’s an elbow in the amount of variance explained that might lead you to pick a natural number of principal components. If an obvious elbow does not exist, as is typical in real-world datasets, consider how else you might determine the number of principal components to retain based on the scree plot.

# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))

# Calculate variability of each component
pr.var <- wisc.pr$sdev^2

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)

# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

What is the minimum number of principal components needed to explain 80% of the variance in the data? 4

2 PCA review and next steps

2.1 Communicating PCA results

The loadings, represented as vectors, explain the mapping from the original features to the principal components. The principal components are naturally ordered from the most variance explained to the least variance explained.

For the first principal component, what is the component of the loading vector for the feature concave.points_mean?

concave.points_mean -0.26085376

What is the minimum number of principal components required to explain 80% of the variance of the data? 5

2.2 Hierarchical clustering of case data

The goal of this exercise is to do hierarchical clustering of the observations. This type of clustering does not assume in advance the number of natural groups that exist in the data.

As part of the preparation for hierarchical clustering, distance between all pairs of observations are computed. Furthermore, there are different ways to link clusters together, with single, complete, and average being the most common linkage methods.

# Scale the wisc.data data: data.scaled
data.scaled <- scale(wisc.data)


# Calculate the (Euclidean) distances: data.dist
data.dist <- dist(data.scaled)

# Create a hierarchical clustering model: wisc.hclust
wisc.hclust = hclust(data.dist, method = "complete")

Let’s use the hierarchical clustering model to determine a height (or distance between clusters) where a certain number of clusters exists.

Using the plot() function, what is the height at which the clustering model has 4 clusters?

plot(wisc.hclust)

20

2.3 Selecting number of clusters

We will compare the outputs from your hierarchical clustering model to the actual diagnoses. Normally when performing unsupervised learning like this, a target variable isn’t available. We do have it with this dataset, however, so it can be used to check the performance of the clustering model.

When performing supervised learning—that is, when we’re trying to predict some target variable of interest and that target variable is available in the original data—using clustering to create new features may or may not improve the performance of the final model. This exercise will help we determine if, in this case, hierarchical clustering provides a promising new feature.

# Cut tree so that it has 4 clusters: wisc.hclust.clusters
wisc.hclust.clusters <- cutree(wisc.hclust, k = 4)

# Compare cluster membership to actual diagnoses
table(wisc.hclust.clusters, diagnosis)
                    diagnosis
wisc.hclust.clusters   0   1
                   1  12 165
                   2   2   5
                   3 343  40
                   4   0   2

Four clusters were picked after some exploration. We may want to explore how different numbers of clusters affect the ability of the hierarchical clustering to separate the different diagnoses.

2.4 k-means clustering and comparing results

There are two main types of clustering: hierarchical and k-means.

you will create a k-means clustering model on the Wisconsin breast cancer data and compare the results to the actual diagnoses and the results of your hierarchical clustering model. Take some time to see how each clustering model performs in terms of separating the two diagnoses and how the clustering models compare to each other.

# Create a k-means model on wisc.data: wisc.km
wisc.km <- kmeans(scale(wisc.data), centers = 2, nstart = 20)

# Compare k-means to actual diagnoses
table(wisc.km$cluster, diagnosis)
   diagnosis
      0   1
  1  14 175
  2 343  37
# Compare k-means to hierarchical clustering
table(wisc.km$cluster, wisc.hclust.clusters)
   wisc.hclust.clusters
      1   2   3   4
  1 160   7  20   2
  2  17   0 363   0

Looking at the second table we generated, it looks like clusters 1, 2, and 4 from the hierarchical clustering model can be interpreted as the cluster 1 equivalent from the k-means algorithm, and cluster 3 can be interpreted as the cluster 2 equivalent.

2.5 Clustering on PCA results

We will put together several steps you used earlier and, in doing so, we will experience some of the creativity that is typical in unsupervised learning.

The PCA model required significantly fewer features to describe 80% and 95% of the variability of the data. In addition to normalizing data and potentially avoiding overfitting, PCA also uncorrelates the variables, sometimes improving the performance of other modeling techniques.

Let’s see if PCA improves or degrades the performance of hierarchical clustering.

Using the minimum number of principal components required to describe at least 90% of the variability in the data, we can create a hierarchical clustering model with complete linkage. Assign the results to wisc.pr.hclust.

# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")

The minimum number of principal components required to describe at least 90% of the variability of the data can be found by calling summary() on the PCA model wisc.pr.

# Cut model into 4 clusters: wisc.pr.hclust.clusters
wisc.pr.hclust.clusters <- cutree(wisc.pr.hclust, k = 4)
# Compare to actual diagnoses
table(diagnosis, wisc.pr.hclust.clusters)
         wisc.pr.hclust.clusters
diagnosis   1   2   3   4
        0   5 350   2   0
        1 113  97   0   2
# Compare to k-means and hierarchical
table(diagnosis, wisc.hclust.clusters)
         wisc.hclust.clusters
diagnosis   1   2   3   4
        0  12   2 343   0
        1 165   5  40   2
table(diagnosis, wisc.km$cluster)
         
diagnosis   1   2
        0  14 343
        1 175  37
---
title: "Unsupervised Learning in R: Wisconsin Cancer"
output:
  html_notebook:
    toc: true
    toc_float: true
    toc_collapsed: false
    number_sections: true
    
toc_depth: 3
---
# Introduction

## Preparing the data

We can get our dataframe using the read.csv() function. We will assign the result to wisc.df.
```{r}
url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_1903/datasets/WisconsinCancer.csv"

# Download the data: wisc.df
wisc.df <- read.csv(url)
```
Let's check the first rows of the dataframe:
```{r}
head(wisc.df)
```
We can use as.matrix() to convert the features of the data (in columns 3 through 32) to a matrix.
```{r}
# Convert the features of the data: wisc.data
wisc.data <- as.matrix(wisc.df[3:32])
```
We can assign the row names of wisc.data the values currently contained in the id column of wisc.df. This will help us keep track of the different observations throughout the modeling process.
```{r}
# Set the row names of wisc.data
row.names(wisc.data) <- wisc.df$id
```

And we also can set a vector called diagnosis to be 1 if a diagnosis is malignant ("M") and 0 otherwise. Note that R coerces TRUE to 1 and FALSE to 0.
```{r}
# Create diagnosis vector
diagnosis <- as.numeric(wisc.df$diagnosis == "M")
```
How many observations are in this dataset?
```{r}
nrow(wisc.data)
```

How many variables/features in the data are suffixed with _mean?
```{r}
length(grep(pattern="_mean", x = colnames(wisc.data)))
```

How many of the observations have a malignant diagnosis?

```{r}
length(which(diagnosis==1))
```
## Performing PCA

The next step is to perform PCA on wisc.data.

It's important to check if the data need to be scaled before performing PCA. Two common reasons for scaling data:

1. The input variables use different units of measurement.
2. The input variables have significantly different variances.
```{r}
# Check column means and standard deviations
colMeans(wisc.data)
apply(wisc.data, 2, sd)
```
```{r}
# Execute PCA, scaling if appropriate: wisc.pr
wisc.pr <- prcomp(x = wisc.data, scale = TRUE)

# Look at summary of results
summary(wisc.pr)
```
## Interpreting PCA results

Now we'll use some visualizations to better understand your PCA model.

We'll run into some common challenges with using biplots on real-world data containing a non-trivial number of observations and variables, then we'll look at some alternative visualizations.
```{r}
# Create a biplot of wisc.pr
biplot(wisc.pr)
```
```{r}
#Scatter plot observations by components 1 and 2
plot(wisc.pr$x[, c(1, 2)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC2")

# Repeat for components 1 and 3
plot(wisc.pr$x[, c(1, 3)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC3")
```
Because principal component 2 explains more variance in the original data than principal component 3, you can see that the first plot has a cleaner cut separating the two subgroups.
```{r}
plot(wisc.pr$x[, c(1, 4)], col = (diagnosis + 1), 
     xlab = "PC1", ylab = "PC3")
```
## Variance explained

We will produce scree plots showing the proportion of variance explained as the number of principal components increases. The data from PCA must be prepared for these plots, as there is not a built-in function in R to create them directly from the PCA model.

As we look at these plots, we are asking if there's an elbow in the amount of variance explained that might lead you to pick a natural number of principal components. If an obvious elbow does not exist, as is typical in real-world datasets, consider how else you might determine the number of principal components to retain based on the scree plot.
```{r}
# Set up 1 x 2 plotting grid
par(mfrow = c(1, 2))

# Calculate variability of each component
pr.var <- wisc.pr$sdev^2

# Variance explained by each principal component: pve
pve <- pr.var / sum(pr.var)

# Plot variance explained for each principal component
plot(pve, xlab = "Principal Component", 
     ylab = "Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")

# Plot cumulative proportion of variance explained
plot(cumsum(pve), xlab = "Principal Component", 
     ylab = "Cumulative Proportion of Variance Explained", 
     ylim = c(0, 1), type = "b")
```
What is the minimum number of principal components needed to explain 80% of the variance in the data? 4

# PCA review and next steps

## Communicating PCA results

The loadings, represented as vectors, explain the mapping from the original features to the principal components. The principal components are naturally ordered from the most variance explained to the least variance explained.

For the first principal component, what is the component of the loading vector for the feature concave.points_mean? 

concave.points_mean     -0.26085376

What is the minimum number of principal components required to explain 80% of the variance of the data? 5

## Hierarchical clustering of case data

The goal of this exercise is to do hierarchical clustering of the observations. This type of clustering does not assume in advance the number of natural groups that exist in the data.

As part of the preparation for hierarchical clustering, distance between all pairs of observations are computed. Furthermore, there are different ways to link clusters together, with single, complete, and average being the most common linkage methods.

```{r}
# Scale the wisc.data data: data.scaled
data.scaled <- scale(wisc.data)


# Calculate the (Euclidean) distances: data.dist
data.dist <- dist(data.scaled)

# Create a hierarchical clustering model: wisc.hclust
wisc.hclust = hclust(data.dist, method = "complete")
```
Let's use the hierarchical clustering model to determine a height (or distance between clusters) where a certain number of clusters exists.

Using the plot() function, what is the height at which the clustering model has 4 clusters?
```{r}
plot(wisc.hclust)
```
20

## Selecting number of clusters

We will compare the outputs from your hierarchical clustering model to the actual diagnoses. Normally when performing unsupervised learning like this, a target variable isn't available. We do have it with this dataset, however, so it can be used to check the performance of the clustering model.

When performing supervised learning—that is, when we're trying to predict some target variable of interest and that target variable is available in the original data—using clustering to create new features may or may not improve the performance of the final model. This exercise will help we determine if, in this case, hierarchical clustering provides a promising new feature.

```{r}
# Cut tree so that it has 4 clusters: wisc.hclust.clusters
wisc.hclust.clusters <- cutree(wisc.hclust, k = 4)

# Compare cluster membership to actual diagnoses
table(wisc.hclust.clusters, diagnosis)
```
Four clusters were picked after some exploration. We may want to explore how different numbers of clusters affect the ability of the hierarchical clustering to separate the different diagnoses.

## k-means clustering and comparing results

There are two main types of clustering: hierarchical and k-means.

you will create a k-means clustering model on the Wisconsin breast cancer data and compare the results to the actual diagnoses and the results of your hierarchical clustering model. Take some time to see how each clustering model performs in terms of separating the two diagnoses and how the clustering models compare to each other.
```{r}
# Create a k-means model on wisc.data: wisc.km
wisc.km <- kmeans(scale(wisc.data), centers = 2, nstart = 20)

# Compare k-means to actual diagnoses
table(wisc.km$cluster, diagnosis)

# Compare k-means to hierarchical clustering
table(wisc.km$cluster, wisc.hclust.clusters)
```
Looking at the second table we generated, it looks like clusters 1, 2, and 4 from the hierarchical clustering model can be interpreted as the cluster 1 equivalent from the k-means algorithm, and cluster 3 can be interpreted as the cluster 2 equivalent.

## Clustering on PCA results

We will put together several steps you used earlier and, in doing so, we will experience some of the creativity that is typical in unsupervised learning.

The PCA model required significantly fewer features to describe 80% and 95% of the variability of the data. In addition to normalizing data and potentially avoiding overfitting, PCA also uncorrelates the variables, sometimes improving the performance of other modeling techniques.

Let's see if PCA improves or degrades the performance of hierarchical clustering.

Using the minimum number of principal components required to describe at least 90% of the variability in the data, we can create a hierarchical clustering model with complete linkage. Assign the results to wisc.pr.hclust.
```{r}
# Create a hierarchical clustering model: wisc.pr.hclust
wisc.pr.hclust <- hclust(dist(wisc.pr$x[, 1:7]), method = "complete")
```
The minimum number of principal components required to describe at least 90% of the variability of the data can be found by calling summary() on the PCA model wisc.pr.
```{r}
# Cut model into 4 clusters: wisc.pr.hclust.clusters
wisc.pr.hclust.clusters <- cutree(wisc.pr.hclust, k = 4)
# Compare to actual diagnoses
table(diagnosis, wisc.pr.hclust.clusters)

# Compare to k-means and hierarchical
table(diagnosis, wisc.hclust.clusters)
table(diagnosis, wisc.km$cluster)
```

