This document explains PCA/clustering related plotting using {ggplot2} and {ggfortify}.

Installation

First, install ggfortify from CRAN.

install.packages('ggfortify')

Plotting PCA (Principal Component Analysis)

{ggfortify} let {ggplot2} know how to interpret PCA objects. After loading {ggfortify}, you can use ggplot2::autoplot function for stats::prcomp and stats::princomp objects.

library(ggfortify)
df <- iris[c(1, 2, 3, 4)]
autoplot(prcomp(df))

PCA result should only contains numeric values. If you want to colorize by non-numeric values which original data has, pass original data using data keyword and then specify column name by colour keyword. Use help(autoplot.prcomp) (or help(autoplot.*) for any other objects) to check available options.

autoplot(prcomp(df), data = iris, colour = 'Species')

Passing label = TRUE draws each data label using rownames

autoplot(prcomp(df), data = iris, colour = 'Species', label = TRUE, label.size = 3)

Passing shape = FALSE makes plot without points. In this case, label is turned on unless otherwise specified.

autoplot(prcomp(df), data = iris, colour = 'Species', shape = FALSE, label.size = 3)

Passing loadings = TRUE draws eigenvectors.

autoplot(prcomp(df), data = iris, colour = 'Species', loadings = TRUE)

You can attach eigenvector labels and change some options.

autoplot(prcomp(df), data = iris, colour = 'Species',
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)

Plotting Factor Analysis

ggfortify supports stats::factanal object as the same manner as PCAs. Available opitons are the same as PCAs.

Important You must specify scores option when calling factanal to calcurate sores (default scores = NULL). Otherwise, plotting will fail.

d.factanal <- factanal(state.x77, factors = 3, scores = 'regression')
autoplot(d.factanal, data = state.x77, colour = 'Income')

autoplot(d.factanal, label = TRUE, label.size = 3,
         loadings = TRUE, loadings.label = TRUE, loadings.label.size  = 3)

Plotting K-means

{ggfortify} supports stats::kmeans class. You must explicitly pass original data to autoplot function via data keyword. Because kmeans object doesn’t store original data. The result will be automatically colorized by categorized cluster.

set.seed(1)
autoplot(kmeans(USArrests, 3), data = USArrests)

autoplot(kmeans(USArrests, 3), data = USArrests, label = TRUE, label.size = 3)

Plotting cluster package

{ggfortify} supports cluster::clara, cluster::fanny, cluster::pam classes. Because these instances should contains original data in its property, there is no need to pass original data explicitly.

library(cluster)
autoplot(clara(iris[-5], 3))

Specifying frame = TRUE in autoplot for stats::kmeans and cluster::* draws convex for each cluster.

autoplot(fanny(iris[-5], 3), frame = TRUE)

If you want probability ellipse, ggplot2 1.0.0 or later is required. Specify whatever supported in ggplot2::stat_ellipse’s type keyword via frame.type option.

autoplot(pam(iris[-5], 3), frame = TRUE, frame.type = 'norm')

Plotting Local Fisher Discriminant Analysis with {lfda} package

{lfda} package supports a set of Local Fisher Discriminant Analysis methods. You can use autoplot to plot the analysis result as the same manner as PCA.

Thanks to the kind contribution of Yuan Tang, the author of {lfda} package.

library(lfda)

# Local Fisher Discriminant Analysis (LFDA)
model <- lfda(iris[-5], iris[, 5], 4, metric="plain")
autoplot(model, data = iris, frame = TRUE, frame.colour = 'Species')

# Kernel Local Fisher Discriminant Analysis (KLFDA)
model <- klfda(kmatrixGauss(iris[-5]), iris[, 5], 4, metric="plain")
autoplot(model, data = iris, frame = TRUE, frame.colour = 'Species')

NOTE Note that for iris data set the relationships between different classes are not linear. Kernel Local Fisher Discriminant Analysis is only aimed for capturing non-linear relationships, especially when it comes to many different classes. In this case, visualization of iris data set is poor because klfda is too strong for capturing linear relationships. If using klfda for this kind of data, later when it comes to classification or clustering tasks, the model would very likely overfit the transformed data set.

# Semi-supervised Local Fisher Discriminant Analysis (SELF)
model <- self(iris[-5], iris[, 5], beta = 0.1, r = 3, metric="plain")
autoplot(model, data = iris, frame = TRUE, frame.colour = 'Species')