These note are primarily taken from the Unsupervised Learning in R DataCamp course.
Unsupervised machine learning searches for structure in unlabeled data (data without a response variable). The goal of unsupervised learning is to find homogenous subgroups (clusters), and to find patterns (usually through dimensionality reduction). Examples of unsupervised machine learning include k-means clustering, hierarchical cluster analysis (HCA), and principal component analysis (PCA). Unsupervised machine learning often has no single goal beyond gaining insight.
Supervised machine learning makes predictions with labeled data. Supervised learning uses regression for quantitative outcomes and classification for qualitative outcomes.1 Reinforcement machine learning is a third type of learning where the machine learns by operating in an environment. Examples of supervised machine learning include decision trees, random forests, and lasso regression.
Dimensionality reduction is a group of methods of finding structure in clusters and aiding visualization. One dimensionality reduction method is principal component analysis (PCA). PCA finds a linear combination of the features to create principal components. PCA maintains as much variance from the original data as possible. The principle features are unrelated to each other (orthogonal).
The prcomp(df, scale = FALSE, center = TRUE)
function performs a PCA on the dataframe. As with other unsupervised machine learning algorithms, you must remove or impute data for missing values. If the data are on different scales, scale them to mean zero and unit variance with scale = TRUE
. The center
parameter shifts the variables to zero-centered and is typically left at TRUE
.
The pokemon
dataset contains observations of 800 pokemons2 More information on the dataset at https://www.kaggle.com/abcsds/pokemon on 6 dimensions. The data is unlabeled, meaning there is no response variable, just features. The features here are six pokeon ability measures.
library(readr)
pokemon <- read_csv(url("https://assets.datacamp.com/production/course_1815/datasets/Pokemon.csv"))
#pokemon$Name <- NULL
pokemon$Number <- NULL
pokemon$Type1 <- NULL
pokemon$Type2 <- NULL
pokemon$Total <- NULL
pokemon$Generation <- NULL
pokemon$Legendary <- NULL
head(pokemon)
## # A tibble: 6 x 7
## Name HitPoints Attack Defense SpecialAttack SpecialDefense Speed
## <chr> <int> <int> <int> <int> <int> <int>
## 1 Bulbasaur 45 49 49 65 65 45
## 2 Ivysaur 60 62 63 80 80 60
## 3 Venusaur 80 82 83 100 100 80
## 4 VenusaurMeg~ 80 100 123 122 120 80
## 5 Charmander 39 52 43 60 50 65
## 6 Charmeleon 58 64 58 80 65 80
Before conducting PCA, check whether any preprocessing is required: Are there any NAs? If so, drop these observations, or impute values. Are all of the features comparable? If not, standardize the variables in the model with scale = TRUE
. Are the features multi-nomial? If so, create binary variables. In this case, the means and standard deviations are similar, but I am scaling anyway for the exercise.
pr.out <- prcomp(x = pokemon[-1], scale = TRUE, center = TRUE)
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.6466 1.0457 0.8825 0.8489 0.65463 0.51681
## Proportion of Variance 0.4519 0.1822 0.1298 0.1201 0.07142 0.04451
## Cumulative Proportion 0.4519 0.6342 0.7640 0.8841 0.95549 1.00000
PCA models in R produce additional diagnostic and output components: center: the column means used to center to the data, or FALSE if the data weren’t centered scale: the column standard deviations used to scale the data, or FALSE if the data weren’t scaled rotation: the directions of the principal component vectors in terms of the original features/variables. This information allows you to define new data in terms of the original principal components x: the value of each observation in the original dataset projected to the principal components
pr.out$center
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 69.25875 79.00125 73.84250 72.82000 71.90250
## Speed
## 68.27750
pr.out$scale
## HitPoints Attack Defense SpecialAttack SpecialDefense
## 25.53467 32.45737 31.18350 32.72229 27.82892
## Speed
## 29.06047
pr.out$rotation
## PC1 PC2 PC3 PC4 PC5
## HitPoints 0.3898858 0.08483455 -0.47192614 0.7176913 -0.21999056
## Attack 0.4392537 -0.01182493 -0.59415339 -0.4058359 0.19025457
## Defense 0.3637473 0.62878867 0.06933913 -0.4192373 -0.05903197
## SpecialAttack 0.4571623 -0.30541446 0.30561186 0.1475166 0.73534497
## SpecialDefense 0.4485704 0.23909670 0.56559403 0.1854448 -0.30019970
## Speed 0.3354405 -0.66846305 0.07851327 -0.2971625 -0.53016082
## PC6
## HitPoints 0.2336690
## Attack -0.5029896
## Defense 0.5368986
## SpecialAttack 0.2045304
## SpecialDefense -0.5451707
## Speed 0.2551400
The biplot()
function plots both the principal components loadings and the mapping of the observations to their first two principal component values. In the plot below, Attack
and HitPoints
have approximately the same loadings in the first two principal components.
biplot(pr.out)
The second common plot type for understanding PCA models is a scree plot. A scree plot shows the variance explained as the number of principal components increases. Sometimes the cumulative variance explained is plotted as well.
# Get proportion of variance for scree plot
pr.var <- pr.out$sdev^2
pve <- pr.var / sum(pr.var)
plot(pve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
plot(cumsum(pve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")