Complete all Exercises, and submit answers to VtopBeta

Introduction

Principal component analysis is a method of extracting important variables (in form of components) from a large set of variables available in a data set. It extracts low dimensional set of features from a high dimensional data set with a motive to capture as much information as possible.

Note: In this exercise, the function prcomp will be used from the stats package. I will also show how to visualize PCA in R using Base R graphics.

Datasets

Iris dataset for clustering
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa

We will apply PCA to the four continuous variables and use the categorical variable to visualize the PCs later.

Notice that in the following code we apply a log transformation to the continuous variables

# log transform 
library(ggplot2)
log.ir <- log(iris[, 1:4])
ir.species <- iris[, 5]

Computing the Principle Component (PC)

Set center and scale. equal to TRUE in the call to prcomp to standardize the variables prior to the application of PCA:

# apply PCA - scale. = TRUE is highly 
# advisable, but default is FALSE. 
ir.pca <- prcomp(log.ir,
                 center = TRUE,
                 scale. = TRUE) 

Analyzing the results

The print method returns the standard deviation of each of the four PCs, and their rotation (or loadings), which are the coefficients of the linear combinations of the continuous variables.

# print method
print(ir.pca)
## Standard deviations (1, .., p=4):
## [1] 1.7124583 0.9523797 0.3647029 0.1656840
## 
## Rotation (n x k) = (4 x 4):
##                     PC1         PC2        PC3         PC4
## Sepal.Length  0.5038236 -0.45499872  0.7088547  0.19147575
## Sepal.Width  -0.3023682 -0.88914419 -0.3311628 -0.09125405
## Petal.Length  0.5767881 -0.03378802 -0.2192793 -0.78618732
## Petal.Width   0.5674952 -0.03545628 -0.5829003  0.58044745

The plot method returns a plot of the variances (y-axis) associated with the PCs (x-axis). The Figure below is useful to decide how many PCs to retain for further analysis. In this simple case with only 4 PCs this is not a hard task and we can see that the first two PCs explain most of the variability in the data.

# plot method
plot(ir.pca, type = "l")

# summary method
summary(ir.pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7125 0.9524 0.36470 0.16568
## Proportion of Variance 0.7331 0.2268 0.03325 0.00686
## Cumulative Proportion  0.7331 0.9599 0.99314 1.00000

Predicting the PCAs

We can use the predict function if we observe new data and want to predict their PCs values. Just for illustration pretend the last two rows of the iris data has just arrived and we want to see what is their PCs values.

# Predict PCs
predict(ir.pca, 
        newdata=tail(log.ir, 2))
##           PC1         PC2        PC3         PC4
## 149 1.0809930 -1.01155751 -0.7082289 -0.06811063
## 150 0.9712116 -0.06158655 -0.5008674 -0.12411524

Inference: The PC values are given above for the first four continuous variables of the iris data set.

Visualizing the PCAs

The Figure below is a biplot generated by the function ggbiplot of the ggbiplot package available on github.

library(ggbiplot)
g <- ggbiplot(ir.pca, obs.scale = 1, var.scale = 1, 
              groups = ir.species, ellipse = TRUE, 
              circle = TRUE)
g <- g + scale_color_discrete(name = '')
g <- g + theme(legend.direction = 'horizontal', 
               legend.position = 'top')
print(g)