Unsupervised Learning with PCA

Principal Component Analysis Using R.

Michael Foley

2019-04-23

These note are primarily taken from the Unsupervised Learning in R DataCamp course.

Background: Unsupervised vs Supervised Machine Learning

Unsupervised machine learning searches for structure in unlabeled data (data without a response variable). The goal of unsupervised learning is to find homogenous subgroups (clusters), and to find patterns (usually through dimensionality reduction). Examples of unsupervised machine learning include k-means clustering, hierarchical cluster analysis (HCA), and principal component analysis (PCA). Unsupervised machine learning often has no single goal beyond gaining insight.

Supervised machine learning makes predictions with labeled data. Supervised learning uses regression for quantitative outcomes and classification for qualitative outcomes.1 Reinforcement machine learning is a third type of learning where the machine learns by operating in an environment. Examples of supervised machine learning include decision trees, random forests, and lasso regression.

PCA

Dimensionality reduction is a group of methods of finding structure in clusters and aiding visualization. One dimensionality reduction method is principal component analysis (PCA). PCA finds a linear combination of the features to create principal components. PCA maintains as much variance from the original data as possible. The principle features are unrelated to each other (orthogonal).

The prcomp(df, scale = FALSE, center = TRUE) function performs a PCA on the dataframe. As with other unsupervised machine learning algorithms, you must remove or impute data for missing values. If the data are on different scales, scale them to mean zero and unit variance with scale = TRUE. The center parameter shifts the variables to zero-centered and is typically left at TRUE.

Example

The pokemon dataset contains observations of 800 pokemons2 More information on the dataset at https://www.kaggle.com/abcsds/pokemon on 6 dimensions. The data is unlabeled, meaning there is no response variable, just features. The features here are six pokeon ability measures.

library(readr)

pokemon <- read_csv(url("https://assets.datacamp.com/production/course_1815/datasets/Pokemon.csv"))
#pokemon$Name <- NULL
pokemon$Number <- NULL
pokemon$Type1 <- NULL
pokemon$Type2 <- NULL
pokemon$Total <- NULL
pokemon$Generation <- NULL
pokemon$Legendary <- NULL

head(pokemon)
## # A tibble: 6 x 7
##   Name         HitPoints Attack Defense SpecialAttack SpecialDefense Speed
##   <chr>            <int>  <int>   <int>         <int>          <int> <int>
## 1 Bulbasaur           45     49      49            65             65    45
## 2 Ivysaur             60     62      63            80             80    60
## 3 Venusaur            80     82      83           100            100    80
## 4 VenusaurMeg~        80    100     123           122            120    80
## 5 Charmander          39     52      43            60             50    65
## 6 Charmeleon          58     64      58            80             65    80

Before conducting PCA, check whether any preprocessing is required: Are there any NAs? If so, drop these observations, or impute values. Are all of the features comparable? If not, standardize the variables in the model with scale = TRUE. Are the features multi-nomial? If so, create binary variables. In this case, the means and standard deviations are similar, but I am scaling anyway for the exercise.

pr.out <- prcomp(x = pokemon[-1], scale = TRUE, center = TRUE)
summary(pr.out)
## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6
## Standard deviation     1.6466 1.0457 0.8825 0.8489 0.65463 0.51681
## Proportion of Variance 0.4519 0.1822 0.1298 0.1201 0.07142 0.04451
## Cumulative Proportion  0.4519 0.6342 0.7640 0.8841 0.95549 1.00000

PCA models in R produce additional diagnostic and output components: center: the column means used to center to the data, or FALSE if the data weren’t centered scale: the column standard deviations used to scale the data, or FALSE if the data weren’t scaled rotation: the directions of the principal component vectors in terms of the original features/variables. This information allows you to define new data in terms of the original principal components x: the value of each observation in the original dataset projected to the principal components

pr.out$center
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       69.25875       79.00125       73.84250       72.82000       71.90250 
##          Speed 
##       68.27750
pr.out$scale
##      HitPoints         Attack        Defense  SpecialAttack SpecialDefense 
##       25.53467       32.45737       31.18350       32.72229       27.82892 
##          Speed 
##       29.06047
pr.out$rotation
##                      PC1         PC2         PC3        PC4         PC5
## HitPoints      0.3898858  0.08483455 -0.47192614  0.7176913 -0.21999056
## Attack         0.4392537 -0.01182493 -0.59415339 -0.4058359  0.19025457
## Defense        0.3637473  0.62878867  0.06933913 -0.4192373 -0.05903197
## SpecialAttack  0.4571623 -0.30541446  0.30561186  0.1475166  0.73534497
## SpecialDefense 0.4485704  0.23909670  0.56559403  0.1854448 -0.30019970
## Speed          0.3354405 -0.66846305  0.07851327 -0.2971625 -0.53016082
##                       PC6
## HitPoints       0.2336690
## Attack         -0.5029896
## Defense         0.5368986
## SpecialAttack   0.2045304
## SpecialDefense -0.5451707
## Speed           0.2551400

The biplot() function plots both the principal components loadings and the mapping of the observations to their first two principal component values. In the plot below, Attack and HitPoints have approximately the same loadings in the first two principal components.

biplot(pr.out)

The second common plot type for understanding PCA models is a scree plot. A scree plot shows the variance explained as the number of principal components increases. Sometimes the cumulative variance explained is plotted as well.

# Get proportion of variance for scree plot
pr.var <- pr.out$sdev^2
pve <- pr.var / sum(pr.var)
plot(pve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

plot(cumsum(pve), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")