Prerequisites

This tutorial relies on the following R packages for plotting:

library(ggplot2)
library(ggrepel)
library(cowplot)

What are dimension reductions?

Dimension reduction is the process of reducing the number of variables (also sometimes referred to as features or of course dimensions) to a set of values of variables called principal variables. The main property of principal variables is the preservation of the structure and information carried by the original variables, to some extent. Principal variables are ordered by importance, with the first variable preserving the most structure and following variables preserving successively less. In other words applying dimension reduction to a data set creates a data set with the same number of observations but less variables, whilst maintaining the general β€œflavor” πŸ‘… of the data set.

There are many different dimension reduction techniques that all try to find a representation of the data in a lower dimensional space that retains as much information as possible. In general there are two classes of dimension reduction:

Feature Elimination is removing some variables completely if they are redundant with some other variable or if they are not providing any new information about the data set. The advantage of feature elimination is that it is simple to implement and makes our data set small, including only variables in which we are interested. But as a disadvantage β€” we might lose some information from the variables which we dropped.

Feature Extraction on the other hand, is the formation of new variables from the old variables. The advantage of feature extraction is that it preserves the most structure of your original data set. But as a disadvantage - the newly created variables loose their interpretation.

In this tutorial we will look at some of the most commonly applied dimension reduction methods of the class of feature extraction. We will discuss general pitfalls and effective use of these dimension reduction techniques. Feature elimination will not be discussed in this tutorial, but good discussions/tutorials can be found here and here.

Why use dimension reductions?

Dimension reduction via feature extraction (hereafter referred to as simply dimension reduction) is an extremely powerful statistical tool and thus is frequently applied. Some of the main uses of dimension reduction are:

  • simplification
  • denoising
  • variable selection
  • visualization

However to really understand the power of even simple dimension reduction techniques, let us consider the real data set cereals. The data set contains 77 cereals with 15 variables:

  • manufacturer (categorical)
  • type (categorical)
  • number of calories (numeric)
  • grams of protein (numeric)
  • grams of fat (numeric)
  • milligrams of sodium (numeric)
  • grams of dietary fiber (numeric)
  • grams of complex carbohydrates (numeric)
  • grams of sugars (numeric)
  • grams of potassium (numeric)
  • percentage of FDA recommended vitamins (numeric)
  • display shelf (categorical)
  • weight in ounces (numeric)
  • number of cups in one serving (numeric)
  • rating of the cereal (numeric)
library(plspm)
data("cereals")
head(cereals)
##                           mfr type calories protein fat sodium fiber carbo
## 100%_Bran                   N    C       70       4   1    130  10.0   5.0
## 100%_Natural_Bran           Q    C      120       3   5     15   2.0   8.0
## All-Bran                    K    C       70       4   1    260   9.0   7.0
## All-Bran_with_Extra_Fiber   K    C       50       4   0    140  14.0   8.0
## Almond_Delight              R    C      110       2   2    200   1.0  14.0
## Apple_Cinnamon_Cheerios     G    C      110       2   2    180   1.5  10.5
##                           sugars potass vitamins shelf weight cups
## 100%_Bran                      6    280       25     3      1 0.33
## 100%_Natural_Bran              8    135        0     3      1 1.00
## All-Bran                       5    320       25     3      1 0.33
## All-Bran_with_Extra_Fiber      0    330       25     3      1 0.50
## Almond_Delight                 8     -1       25     3      1 0.75
## Apple_Cinnamon_Cheerios       10     70       25     1      1 0.75
##                             rating
## 100%_Bran                 68.40297
## 100%_Natural_Bran         33.98368
## All-Bran                  59.42551
## All-Bran_with_Extra_Fiber 93.70491
## Almond_Delight            34.38484
## Apple_Cinnamon_Cheerios   29.50954

For the moment let’s discount the categorical variables. Even then we have 12 variables.

numeric_vars <- c(
  "calories", "protein", "fat", "sodium", "fiber", "carbo",
  "sugars", "potass", "vitamins", "weight", "cups", "rating"
)

cereals_num <- cereals[, numeric_vars]
ncol(cereals_num)
## [1] 12

(If you have problems with the dataset you can also download it here.)

Now let’s imagine you are interested in two particular subsets of cereals: the fun ones 🎑 and the shredded wheat πŸ’ͺ ones.

fun_cereals <- c(
  "Lucky_Charms", "Count_Chocula", "Froot_Loops",
  "Frosted_Flakes", "Cocoa_Puffs", "Cinnamon_Toast_Crunch"
)

shredded_wheat_cereals <- c(
  "Shredded_Wheat", "Shredded_Wheat_'n'Bran",
  "Shredded_Wheat_spoon_size"
)

cereals_num_sub <- match(
  c(fun_cereals, shredded_wheat_cereals),
  rownames(cereals_num)
)
cereals_num$classification <- "normal"
cereals_num$classification[match(fun_cereals, rownames(cereals_num))] <- "fun"
cereals_num$classification[match(
  shredded_wheat_cereals,
  rownames(cereals_num)
)] <- "shredded"

cereals_num$label <- ""
cereals_num$label[cereals_num_sub] <- rownames(cereals_num)[cereals_num_sub]

Imagine you want to answer the following questions regarding those:

  • Are cereals in the fun group 🎑 alike?
  • Are cereals in the shredded group πŸ’ͺ alike?
  • How different are cereals in the fun group 🎒 and the shredded group πŸ’ͺ?
  • Are there other cereals which are like the ones in the fun group 🎑?

Using simple visualization techniques this is incredibly hard and often leads to unsatisfactory answers. To demonstrate this we will just consider fats, calories, sodium and sugars. We can see that in terms of amount of sugars shredded cereals πŸ’ͺ are alike and so are fun cereals 🎒, but they are very different from each other. However in terms of amounts of fats they are more similar. So are they similar or different? And how would our assessment change if we added the other 8 numerical variables in the cereals data set, that we have currently discounted?

library(GGally)
ggpairs(cereals_num,
  columns = c("fat", "calories", "sodium", "sugars"),
  ggplot2::aes(colour = classification)
)

Clearly, we need a way to visualize most of complexity represented in all dimensions in just two or three dimensions. Fortunately, that is exactly what dimension reduction does. Here, we apply a simple dimension reduction technique called principal component analysis (PCA) to all 12 numerical variables in the data set. This does indeed confirm that the shredded wheat cereals πŸ’ͺ are very similar in the two most variable directions of the space created by the data (i.e.Β the directions where the data is most spread out). They are also very different from the fun cereals 🎑.

However also note that the newly created variables have lost their interpretation. They represent a combination of the 12 numerical variables. So we cannot know for what reasons (i.e. which variables precisely) the fun cereals 🎑 are similar. We just know that considering all 12 numerical variables that the fun cereals 🎑 are similar.

prcomp_cereals_num <- prcomp(cereals_num[, 1:12])
pca_cereals_num <- data.frame(
  PC1 = prcomp_cereals_num$x[, 1],
  PC2 = prcomp_cereals_num$x[, 2],
  label = cereals_num$label,
  classification = cereals_num$classification
)

ggplot(pca_cereals_num, aes(x = PC1, y = PC2, label = label, col = classification)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

Using the principal component representation, we can now easily find cereals that are similar to some of the fun cereals 🎑. These will have similar values of the principal components as the fun cereals 🎒. According to the first two principal components other fun cereals could be Honeycomb, Cap’n’Crunch or Honey Graham Ohs. Notice that even the packaging of these cereals is much more similar to the packaging of the fun cereals 🎑 than the shredded cereals πŸ’ͺ.

cereals_num_int <- which(pca_cereals_num$PC1 > 0 &
  pca_cereals_num$PC1 < 75 &
  pca_cereals_num$PC2 > 25)

pca_cereals_num$label2 <- ""
pca_cereals_num$label2[cereals_num_int] <- rownames(cereals_num)[cereals_num_int]
pca_cereals_num$label2[cereals_num_sub] <- ""

ggplot(pca_cereals_num, aes(x = PC1, y = PC2, label = label2, col = classification)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

A quick word about the output of the function prcomp that calculates the principal component analysis. The output is an object that contains multiple results, but most notably:

  • $sdev The standard deviation of every principal component
  • $rotation The relationship between the initial variables and the principal components. This may be of interest to you for feature selection purposes, etc.
  • $x The values of the principal components for every sample. This will be the most interesting to you, as it can be used for visualization purposes, etc.

For more information on this, run ?prcomp πŸ’». Also note that most other implementations of dimension reduction techniques in R output multiple results. In order to understand the output use a combination of search and the R function str, as demonstrated below.

str(prcomp_cereals_num)
## List of 5
##  $ sdev    : num [1:12] 84.83 71.37 22.38 18.87 8.63 ...
##  $ rotation: num [1:12, 1:12] 0.076192 -0.001398 -0.000117 0.983022 -0.004605 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:12] "calories" "protein" "fat" "sodium" ...
##   .. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:12] 106.88 2.55 1.01 159.68 2.15 ...
##   ..- attr(*, "names")= chr [1:12] "calories" "protein" "fat" "sodium" ...
##  $ scale   : logi FALSE
##  $ x       : num [1:77, 1:12] -54.2 -147.7 70 -53.2 50.6 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:77] "100%_Bran" "100%_Natural_Bran" "All-Bran" "All-Bran_with_Extra_Fiber" ...
##   .. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

The most common dimension reduction techniques

We will now go over some of the most common reduction techniques. I will try to point out their common use cases, their strengths and their weaknesses. This is by no means an exhaustive list. This is a very active area of research and its close ties with machine learning have meant an explosion of new methods.

Principal component analysis (PCA)

Principal component analysis (PCA) is a statistical procedure that uses an rotation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is often a vital step in the determination of more complex dimension reductions.

We have already applied PCA to the data in the earlier example.

Strength

πŸ‘ can be efficiently applied to large data sets

πŸ˜„ has many extensions that deal with special situation such as sparse PCA when data has many missing values, kernel PCA can provide more robust solutions

πŸ˜† preserves global structures

Weaknesses

😨 may suffer from scale problems, i.e. when one variable dominates the variance simply because it is in a higher scale (for example if one variable is measured in mm instead of km); some of these scale problems can simply be dealt with by centering and scaling

😨 suffers from crowding in the presence of large numbers of observations, i.e. points do not form clusters but instead occupy the entire plotting space

πŸ‘Ώ is susceptible to big outliers

Multidimensional scaling (MDS)

Multidimensional scaling (MDS) is an algorithm that given a distance matrix with the distances between each pair of objects in a set, and a chosen number of dimensions, N, places each object into N-dimensional space such that the between-object distances are preserved as well as possible. MDS is most often used as a visualization tool. It is best suited to the problem that involve real distances, i.e.Β city distances or geometry. It yields the same result as PCA when Euclidean distances are used.

dist_cereals_num <- dist(cereals_num[, 1:12], method = "canberra")
mds_cereals_num <- cmdscale(dist_cereals_num, eig = TRUE, k = 2)

mds_cereals_num <- data.frame(
  MDS1 = mds_cereals_num$points[, 1],
  MDS2 = mds_cereals_num$points[, 2],
  label = cereals_num$label,
  classification = cereals_num$classification
)

ggplot(mds_cereals_num, aes(
  x = MDS1, y = MDS2,
  label = label, col = classification
)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

Let’s again check what other cereals are similar to the fun cereals 🎑. We can see that all three cereals - Honeycomb, Cap’n’Crunch and Honey Graham Ohs - are classified as similar to the fun ones 🎒. This points to the inherent similarity of MDS and PCA in simple cases.

cereals_num_int <- which(mds_cereals_num$MDS1 < -0.5 &
  mds_cereals_num$MDS2 < 0 &
  mds_cereals_num$MDS2 > -2.5)

mds_cereals_num$label2 <- ""
mds_cereals_num$label2[cereals_num_int] <- rownames(cereals_num)[cereals_num_int]
mds_cereals_num$label2[cereals_num_sub] <- ""

ggplot(mds_cereals_num, aes(
  x = MDS1, y = MDS2,
  label = label2, col = classification
)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

Strength

πŸ˜† flexible as any distance metric can be used

πŸ˜‚ preserves global structures

Weaknesses

😠 computationally demanding and extremely inefficient for large numbers of observations

😨 suffers from crowding in the presence of large number of observations

Independent component analysis (ICA)

Independent component analysis (ICA) attempts to decompose a multivariate signal into statistically independent non-Gaussian signals. ICA is important to blind signal separation and has many practical applications. The most common is applications for sound signals. A simple application of ICA is the β€œcocktail party problem”, where the underlying speech signals are separated from a sample data consisting of people talking simultaneously in a room. Usually the problem is simplified by assuming no time delays or echoes. Note that a filtered and delayed signal is a copy of a dependent component, and thus the statistical independence assumption is not violated.

library(ica)
ica_cereals_num <- icafast(cereals_num[, 1:12], 2,
  center = TRUE, maxit = 100,
  tol = 1e-6
)

ica_cereals_num <- data.frame(
  ICA1 = ica_cereals_num$Y[, 1],
  ICA2 = ica_cereals_num$Y[, 2],
  label = cereals_num$label,
  classification = cereals_num$classification
)

ggplot(ica_cereals_num, aes(
  x = ICA1, y = ICA2,
  label = label, col = classification
)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

Once again check what other cereals are similar to the fun cereals 🎒. Like before, the selection of similar cereals to the fun cereals include the three previously identified.

It should be noted that the application of ICA in this scenario is likely to violate one of the key assumptions. It is extremely plausible that at least one of the 12 variables has a normal distribution.

cereals_num_int <- which(ica_cereals_num$ICA1 > 0.25 &
  ica_cereals_num$ICA1 < 1 &
  ica_cereals_num$ICA2 < 2 &
  ica_cereals_num$ICA2 > 0)

ica_cereals_num$label2 <- ""
ica_cereals_num$label2[cereals_num_int] <- rownames(cereals_num)[cereals_num_int]
ica_cereals_num$label2[cereals_num_sub] <- ""

ggplot(ica_cereals_num, aes(
  x = ICA1, y = ICA2,
  label = label2, col = classification
)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

Strength

πŸ˜‚ separating multiple independent sources of signal

πŸ˜€ can be efficiently applied to large data sets

πŸ‘ preserves global structures

Weaknesses

πŸ’€ has very specific assumptions, especially that none of the independent sources are normally distributed

😠 suffers from crowding in the presence of large number of observations

t-distributed stochastic neighbor embedding (t-SNE)

T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimension reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The main application of t-SNE is for visualizations of large amount of data, as it can recover well-separated clusters.

library(Rtsne)
tsne_cereals_num <- Rtsne(cereals_num[, 1:12],
  pca = FALSE, perplexity = 10,
  theta = 0.0
)

tsne_cereals_num <- data.frame(
  TSNE1 = tsne_cereals_num$Y[, 1],
  TSNE2 = tsne_cereals_num$Y[, 2],
  label = cereals_num$label,
  classification = cereals_num$classification
)

ggplot(tsne_cereals_num, aes(
  x = TSNE1, y = TSNE2,
  label = label, col = classification
)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

Once again check what other cereals are similar to the fun cereals 🎒. Note that this time we get less cereals because the clustering is more pronounced than in the earlier dimension reductions. Even so all three cereals previously identified are once again seen.

cereals_num_int <- which(tsne_cereals_num$TSNE2 < -10 &
  tsne_cereals_num$TSNE2 > -45 &
  tsne_cereals_num$TSNE1 < 25 &
  tsne_cereals_num$TSNE1 > 10)

tsne_cereals_num$label2 <- ""
tsne_cereals_num$label2[cereals_num_int] <- rownames(cereals_num)[cereals_num_int]
tsne_cereals_num$label2[cereals_num_sub] <- ""

ggplot(tsne_cereals_num, aes(
  x = TSNE1, y = TSNE2,
  label = label2, col = classification
)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

Strength

πŸ₯° recovers well-separated clusters

πŸ’ƒ perseveres local neighborhoods

πŸ‘Ό no crowding of observations

Weaknesses

πŸ™ stochastic, so unless you set a seed your results are not reproducible

πŸ˜’ computationally expensive with extremely large number of observations

😒 one parameter that can be very influential (see [Pitfalls of dimension reduction])

πŸ˜’ can produce artifacts (see [Pitfalls of dimension reduction])

πŸ—― no preservation of global structures

Uniform manifold approximation and projection (UMAP)

Uniform Manifold Approximation and Projection (UMAP) produce a low dimensional representation of high dimensional data that preserves relevant structure. It searches for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure to the original high dimensional data. UMAP is similar to t-SNE but preserves more global structure in the data. Its main use is visualization of a large amount of observations as it recovers well-separated clusters.

library(uwot)
umap_cereals_num <- umap(cereals_num[, 1:12],
  n_neighbors = 15,
  min_dist = 1, spread = 5
)

umap_cereals_num <- data.frame(
  UMAP1 = umap_cereals_num[, 1],
  UMAP2 = umap_cereals_num[, 2],
  label = cereals_num$label,
  classification = cereals_num$classification
)

ggplot(umap_cereals_num, aes(
  x = UMAP1, y = UMAP2,
  label = label, col = classification
)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

Once again check what other cereals are similar to the fun cereals 🎑. Note that the fun cereals in this representation are much more spread out. Thus we have a harder time identifying similar cereals to the fun cereals 🎑. UMAP works best when we have a large number of variables. Clearly, this situation is not given here and thus we may want to reconsider our use of UMAP for this situation.

cereals_num_int <- which(umap_cereals_num$UMAP1 < 2 &
  umap_cereals_num$UMAP1 > -10 &
  umap_cereals_num$UMAP2 > 2 &
  umap_cereals_num$UMAP2 < 12)

umap_cereals_num$label2 <- ""
umap_cereals_num$label2[cereals_num_int] <- rownames(cereals_num)[cereals_num_int]
umap_cereals_num$label2[cereals_num_sub] <- ""

ggplot(umap_cereals_num, aes(
  x = UMAP1, y = UMAP2,
  label = label2, col = classification
)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

Strength

πŸ’ƒ recovers well-separated clusters

πŸ‘ computationally efficient

πŸ˜„ preserves global distances

Weaknesses

πŸ‘Ž somewhat stochastic

😱 parameters to optimize

⚑ can produce artifact

Autoencoders

Autoencoders are neural networks that are trained to reconstruct their original inputs. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimension reduction. This is achieved by designing a neural network with a bottleneck (to avoid overfitting) and training the network to ignore signal β€œnoise”. The idea of autoencoders has been popular in the field of neural networks for decades, but is most commonly used for denoising pictures or sound. As with any neural network there is a lot of flexibility in how autoencoders can be constructed such as the number of hidden layers and the training strategy (i.e.Β variational autoencoders). Hence there are many different implementations of autoencoders for the purpose of dimension reduction. Here we will look at RUTA, however if you want to learn how to construct your own autoencoder in R, please refer to this.

library(ruta)

x <- as.matrix(cereals_num[, 1:12])
ruta_cereals_num <- autoencode(scale(x), 2, type = "robust", 
                               activation = "tanh")

ruta_cereals_num <- data.frame(
  RUTA1 = ruta_cereals_num[, 1],
  RUTA2 = ruta_cereals_num[, 2],
  label = cereals_num$label,
  classification = cereals_num$classification
)

ggplot(ruta_cereals_num, aes(x = RUTA1, y = RUTA2, 
       label = label, col = classification)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

We do not check the fun 🎑 cereals for RUTA, since its results appear unstable. Like UMAP, RUTA does best when there is a large number of variables. Furthermore, to work effectively it requires a large number of observations.

Strength

πŸ˜„ preserves global data structures

😏 new points can be added without recomputing

Weaknesses

πŸ˜’ large number of observations required for effective training

😫 computationally more expensive

☠️ stochastic

😦 parameter selection, especially neural network selection and training

Potential of heat-diffusion for affinity-based trajectory embedding (PHATE)

Potential of heat-diffusion for affinity-based trajectory embedding (PHATE) is a non-linear embedding algorithm , like UMAP and autoencoders. PHATE simultaneously denoises the data and emphasizes the continuous nature of any underlying progressions and trajectories. PHATE naturally uncovers branching progression structures in the data in very low dimensions to enable visualization. PHATE is typically used for visualizations, but can also be used to recover trajectories.

library(phateR)

x <- as.matrix(cereals_num[, 1:12])
phate_cereals_num <- phate(scale(x))

phate_cereals_num <- data.frame(
  PHATE1 = phate_cereals_num$embedding[, 1],
  PHATE2 = phate_cereals_num$embedding[, 2],
  label = cereals_num$label,
  classification = cereals_num$classification
)

ggplot(phate_cereals_num, aes(x = PHATE1, y = PHATE2, 
       label = label, col = classification)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

Once again check what other cereals are similar to the fun cereals 🎑. Note that this time we an extreme separation between the fun 🎒 and the shredded πŸ’ͺ cereals. We do identify all cereals previously identified.

cereals_num_int <- which(phate_cereals_num$PHATE1 > 0)

phate_cereals_num$label2 <- ""
phate_cereals_num$label2[cereals_num_int] <- rownames(cereals_num)[cereals_num_int]
phate_cereals_num$label2[cereals_num_sub] <- ""

ggplot(phate_cereals_num, aes(
  x = PHATE1, y = PHATE2,
  label = label2, col = classification
)) +
  geom_point() +
  ggrepel::geom_text_repel(cex = 2.5)

Strength

πŸ˜„ preserves global structures

πŸ˜„ computationally fast

Weaknesses

πŸ˜’ large number of observations required

😦 parameter selection

Some more considerations regarding dimension reductions

Choosing the right dimension reduction technique

The difficulty in applying dimension reduction is that each dimension reduction technique is designed to maintain certain aspects of the original data and therefore may be appropriate for one task and inappropriate for another. For many applications it can be difficult to objectively find the right method or parameterization for the dimension reduction task. To find the right dimension reduction technique for your data consider the nature and resolution of your data. Are you interested or able to capture only small-scale relationships or are you focusing on long-range relationships. In the first case you may want to only consider dimension reduction techniques that preserve local structure and in the second case you should focus on dimension reduction techniques that preserve global structure.

To understand whether your chosen algorithms distort distances you can make use of the dimRed package. In particular, this package provides a useful visualization. This visualization shows the β€œcorrectness” of the dimension reduction (R_NX) for different numbers of nearest neighbors (K). You can use this in order to see whether distances get distorted when you vary the number of neighbors considered. Here you can see that for such a simple example as the cereal data set, simple approaches like PCA and ICA are much better at preserving distances than UMAP. Note that this package does not necessarily provide every dimension reduction but it is excellent for developing an intuition for your data.

library(dimRed)
embed_methods <- c("PCA", "FastICA", "UMAP")

data_emb <- lapply(embed_methods, function(x)
  embed(cereals_num[, 1:12], x))
names(data_emb) <- embed_methods
plot_R_NX(data_emb)

Unsung uses of dimension reductions

Dimension reduction techniques can be used to create representations of data that are significantly smaller and thus use less memory. Simply, put by taking advantage of multidisciplinary we can remove mostly redundant information. However, once removed this information cannot be restored. Also your new data will have variables that represent combinations of your original variables. So use with care πŸ’….

Pitfalls of dimension reductions

Although extremely useful, dimension reduction can sometimes be mysterious or misleading. In order to learn how to use effectively it is good to develop an intuition and be aware of its pitfalls.

Parameter selection

You may have noticed that some of the dimension reduction techniques, but especially the more complex ones, require parameters to be set. These parameters control important aspects of the dimension reduction and can really influence your resulting perception of the data. In order to demonstrate this, we will look at t-SNE and its tuneable parameter perplexity. Perplexity controls how to balance the attention between local 🏠 and global 🌐 aspects of your data. In particular, notice that when the perplexity is low arbitrary small clusters start forming. Also notice that with changing perplexity the distance between shredded πŸ’ͺ and fun cereals 🎒 changes. This points to the problem that distances, especially long distances, in t-SNE plots are meaningless and should not be interpreted.

perplexity <- seq(2, 20, 4)
tsne_perp_cereals_num <- lapply(perplexity, function(x)
  Rtsne(cereals_num[, 1:12], pca = FALSE, perplexity = x, theta = 0.0))

tsne_perp_cereals_num <- lapply(tsne_perp_cereals_num, function(x)
  data.frame(
    TSNE1 = x$Y[, 1],
    TSNE2 = x$Y[, 2],
    label = cereals_num$label,
    classification = cereals_num$classification
  ))

all_plots <- lapply(tsne_perp_cereals_num, function(x) {
  pp <- ggplot(x, aes(x = TSNE1, y = TSNE2, col = classification)) +
    geom_point() + guides(col = FALSE)
  pp
})
plot_grid(plotlist = all_plots, labels = perplexity, label_size = 12)

Note that other dimension reduction techniques, such as UMAP, also have parameters that have a strong influence on the results.


πŸ‘©β€πŸ’» Advanced Exerice πŸ‘¨β€πŸ’»

Use the package dimRed in order to plot the distance distortion experienced for a range of perplexity values when applying t-SNE to the cereals data set.


Reproducability

Some of the more complex dimension reduction techniques, and in particular t-SNE, have random components 🚷. Thus even when you run them twice with exactly the same parameters you will obtain slightly different answers. Because of this inherent randomness it is also important to run dimension reduction techniques, t-SNE multiple times in order to assess the robustness of the results.

tsne_ran_cereals_num <- lapply(1:2, function(x)
  Rtsne(cereals_num[, 1:12], pca = FALSE, perplexity = 10, theta = 0.0))

tsne_ran_cereals_num <- lapply(tsne_ran_cereals_num, function(x)
  data.frame(
    TSNE1 = x$Y[, 1],
    TSNE2 = x$Y[, 2],
    label = cereals_num$label,
    classification = cereals_num$classification
  ))

all_plots <- lapply(tsne_ran_cereals_num, function(x) {
  pp <- ggplot(x, aes(x = TSNE1, y = TSNE2, col = classification)) +
    geom_point() + guides(col = FALSE)
  pp
})
plot_grid(plotlist = all_plots)

Unfortunately, the inherent randomness of some dimension reduction techniques makes them a real headache πŸ€’ for reproducible analysis. To ensure that you obtain the same results every time you will need to set a seed using the function set.seed.

tsne_ran_cereals_num <- lapply(1:2, function(x) {
  set.seed(1)
  Rtsne(cereals_num[, 1:12], pca = FALSE, perplexity = 10, theta = 0.0)
})

tsne_ran_cereals_num <- lapply(tsne_ran_cereals_num, function(x)
  data.frame(
    TSNE1 = x$Y[, 1],
    TSNE2 = x$Y[, 2],
    label = cereals_num$label,
    classification = cereals_num$classification
  ))

all_plots <- lapply(tsne_ran_cereals_num, function(x) {
  pp <- ggplot(x, aes(x = TSNE1, y = TSNE2, col = classification)) +
    geom_point() + guides(col = FALSE)
  pp
})
plot_grid(plotlist = all_plots)

Random noise can look like a pattern

Another common pitfall of dimension reduction is that random noise might look like a pattern. Recognizing noise when you see it is a critical skill, but it takes time to build up the right intuitions. A tricky thing about more complex dimension reductions is that they throw a lot of existing intuition out the window. We will demonstrate this by simulating 100 random standard Gaussian observations in 100 dimensions.

set.seed(10)
sim_data <- sapply(1:100, function(x) rnorm(100, 1, 1))

tsne_sim_data <- Rtsne(sim_data, pca = FALSE, perplexity = 2, theta = 0.0)

tsne_sim_data <- data.frame(
  TSNE1 = tsne_sim_data$Y[, 1],
  TSNE2 = tsne_sim_data$Y[, 2]
)

pca_sim_data <- prcomp(sim_data)

pca_sim_data <- data.frame(
  PCA1 = pca_sim_data$x[, 1],
  PCA2 = pca_sim_data$x[, 2]
)

p1 <- ggplot(tsne_sim_data, aes(x = TSNE1, y = TSNE2)) +
  geom_point() + ggtitle("TNSE")
p2 <- ggplot(pca_sim_data, aes(x = PCA1, y = PCA2)) +
  geom_point() + ggtitle("PCA")

plot_grid(p1, p2)


πŸ‘¨β€πŸ’» Advanced Exerice πŸ‘¨β€πŸ’»

Investigate this phenomenon with other types of non-Gaussian random noise and for other complex dimension reductions.


5 tips for effective use of dimension reductions

In this last section I will present some tips of how to work effectively with dimension reduction techniques.

Tip 1: Preprocess input data

We have already touched upon the need to scale the data in order to ensure that each variable contributes equally. However, depending on the nature of the data other data transformations may be required. For example, if changes in your data are multiplicative, e.g., your variables measure percent increase/decrease, you should consider using a log-transform before applying dimension reductions such as PCA.

Tip 2: The right number of dimensions to retain

So far we have really only looked at the first two dimensions obtained after dimension reduction. However, this was merely a choice. In fact dimensions beyond the first two, often contain vital information. One way to check this is to look at the variance explained by each principle component. In fact it is good practice when using PCA to state the explained variance by every principle component plotted. However, this being said. It makes little sense to present dimension reduction plots with arbitrarily chosen dimensions, like dimension 2 versus dimension 5, if you are not also willing to present all preceding plots or have explanations for why these are not of interest.

prcomp_cereals_num <- prcomp(cereals_num[, 1:12])
var_exp <- round(prcomp_cereals_num$sdev[1:4] / sum(prcomp_cereals_num$sdev) * 100, 2)
pca_cereals_num <- data.frame(
  PC1 = prcomp_cereals_num$x[, 1],
  PC2 = prcomp_cereals_num$x[, 2],
  PC3 = prcomp_cereals_num$x[, 3],
  PC4 = prcomp_cereals_num$x[, 4],
  label = cereals_num$label,
  classification = cereals_num$classification
)

comb <- list(
  first = c(1, 2), second = c(1, 3), third = c(1, 4), fourth = c(2, 3),
  fifth = c(2, 4), sixth = c(3, 4)
)

gg <- lapply(comb, function(x) {
  ggplot(pca_cereals_num, aes_string(
    x = paste0("PC", x[1]),
    y = paste0("PC", x[2]), col = "classification"
  )) +
    geom_point() + guides(col = FALSE) + ylab(paste0(
      "PC", x[2], ":(",
      var_exp[x[2]], "%)"
    )) + xlab(paste0("PC", x[1], ":(", var_exp[x[1]], "%)"))
})

plot_grid(plotlist = gg)

So how do you choose a suitable number of dimensions?

Obviously this depends on your data and will require you to look at your data πŸ€“. For dimension reduction techniques based on spectral decomposition, such as PCA, you could use the distribution of the eigenvalues and find the point where the explained variance significantly drops. Typically this is done via a scree plot. It turns out that the point where the explained variance significantly drops is after 2 dimensions.

var_exp <- data.frame(
  "Variance explained" =
    prcomp_cereals_num$sdev / sum(prcomp_cereals_num$sdev) * 100,
  "Number of dimension" = 1:length(prcomp_cereals_num$sde),
  check.names = FALSE
)

ggplot(var_exp, aes(x = `Number of dimension`, y = `Variance explained`)) +
  geom_col() + geom_line()

Note that this step is particularly important when dimension reduction is applied as a preprocessing step preceding statistical analyses or machine learning tasks (e.g., clustering). However, even when your primary goal is data visualization, in which only two or three axes can be displayed at a time, you still need to select a sufficient number of new features to generate.

Tip 3: Check the robustness of your results

We have already seen that parameters can seriously affect the results of dimension reduction techniques. If you are applying a dimension reduction technique with parameters, it is highly advisable to check how results change when parameters are varied. Only patterns that are stable across a range of parameters can be trusted.

We have also mentioned that outliers can have a big effect on linear dimension reduction techniques, such as PCA. You should be mindful of situations where all of the points in a plot are located close to the origin (the center of the plot), with only one or a few points lying very far away, as this points towards the dimension reduction being dominated by outlier points. To be able to continue you will need to reconsider your quality control metrics and possible removal of these points.

Tip 4: Find hidden signal

The primary objective of dimension reduction is to compress data to uncover patterns πŸ•΅. Often after identifying patterns, it is of interest to explain their appearance, i.e.Β find the β€œhidden signal”. In data collection, we often collect additional information that are not included in the dimension reduction calculations but can hold the key to understanding these patterns. The simplest way to reveal such patterns is to use these external covariates is to include them in dimension reduction visualizations β€” with their values encoded as color, shape, size, or even transparency of corresponding points on the plot.

pca_cereals_num$manufacturer <- cereals$mfr
pca_cereals_num$type <- cereals$type

ggplot(pca_cereals_num, aes(
  x = PC1, y = PC2, label = label, col = manufacturer,
  shape = type
)) +
  geom_point(alpha = 0.75) +
  ggrepel::geom_text_repel(cex = 2.5)

Another way to find β€œhidden signal” is via the use of biplots. These plots allow you to explore the trends in the data samples and features simultaneously; looking at both at the same time, you might discover groups of similar (close by) observations that have high or low values for certain measured variables. Here you can observe that shredded cereals have lower sodium content than the fun cereals.

library(ggfortify)

pca_cereals_num <- prcomp(cereals_num[, 1:12], cor = TRUE)
## Warning: In prcomp.default(cereals_num[, 1:12], cor = TRUE) :
##  extra argument 'cor' will be disregarded
autoplot(pca_cereals_num,
  data = cereals_num, colour = "classification",
  loadings = TRUE, loadings.colour = "blue",
  loadings.label = TRUE, loadings.label.size = 3
)

Tip 5: Handle categorical input appropriatley

So far we have exclusively worked with numerical data, but what if your data is categorical?

If your data consists only of categorical data, you might want to give correspondence analysis (CA) or multiple correspondance analysis (MCA) a try. If you have mixed data that consists of both categorical and numerical data, the appropriate method is optimal scaling categorical PCA (CATPCA), which converts original levels of categorical variables with category quantifications such that the variance in the new variables is maximized. An advantage of optimal scaling is that it does not assume a linear relationship between variables. In fact, the ability of CATPCA to handle nonlinear relations between variables is important even when the input data are all numeric.

A simpler way to handle a small number of categorical variable is to visualize them on the dimension reduction plot, as done in tip 4.


πŸ‘¨β€πŸ’» Advanced Exerice πŸ‘¨β€πŸ’»

Try applying CATPCA to the entirety of the cereals data using the package princals.



πŸ“š General Exerice πŸ“š

Investigate the wine data set which contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. The data set can be found in the package rattle.data.

  1. Apply a simple PCA to the data and visualize the result. Remember to include the explained variance for each dimension and make a decision on how many dimensions are required.

  2. Overlay the plot/plots with the information regarding the type of wine. What conclusions can you draw?

  3. Apply a more complex dimension reduction and investigate how your results change.


References

  • Nguyen, Lan Huong, and Susan Holmes. β€œTen quick tips for effective dimensionality reduction.” PLOS Computational Biology 15.6 (2019): e1006907.
  • Li, Lexin. β€œDimension reduction for high-dimensional data.” Statistical methods in molecular biology. Humana Press, Totowa, NJ, 2010. 417-434.

Session info

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggfortify_0.4.7 dimRed_0.2.3    DRR_0.0.3       CVST_0.2-2     
##  [5] kernlab_0.9-27  phateR_0.4.1    ruta_1.1.0      uwot_0.1.3     
##  [9] Matrix_1.2-17   Rtsne_0.15      ica_1.0-2       GGally_1.4.0   
## [13] plspm_0.4.9     cowplot_1.0.0   ggrepel_0.8.1   ggplot2_3.2.1  
## [17] emo_0.0.0.9000  here_0.1       
## 
## loaded via a namespace (and not attached):
##  [1] tidyr_0.8.3        jsonlite_1.6       R.utils_2.9.0     
##  [4] RcppParallel_4.4.3 assertthat_0.2.1   amap_0.8-17       
##  [7] yaml_2.2.0         diagram_1.6.4      pillar_1.4.2      
## [10] backports_1.1.4    lattice_0.20-38    glue_1.3.1        
## [13] reticulate_1.13    digest_0.6.20      RColorBrewer_1.1-2
## [16] colorspace_1.4-1   fastICA_1.2-2      htmltools_0.3.6   
## [19] R.oo_1.22.0        plyr_1.8.4         turner_0.1.7      
## [22] pkgconfig_2.0.2    purrr_0.3.2        scales_1.0.0      
## [25] whisker_0.3-2      RSpectra_0.15-0    tester_0.1.7      
## [28] tibble_2.1.3       generics_0.0.2     withr_2.1.2       
## [31] umap_0.2.2.0       keras_2.2.4.1      coRanking_0.1.4   
## [34] lazyeval_0.2.2     magrittr_1.5       crayon_1.3.4      
## [37] evaluate_0.14      R.methodsS3_1.7.1  FNN_1.1.3         
## [40] tools_3.6.1        stringr_1.4.0      munsell_0.5.0     
## [43] compiler_3.6.1     rlang_0.4.0        grid_3.6.1        
## [46] base64enc_0.1-3    labeling_0.3       rmarkdown_1.14    
## [49] gtable_0.3.0       reshape_0.8.8      reshape2_1.4.3    
## [52] R6_2.4.0           gridExtra_2.3      tfruns_1.4        
## [55] lubridate_1.7.4    knitr_1.24         dplyr_0.8.3       
## [58] tensorflow_1.14.0  zeallot_0.1.0      rprojroot_1.3-2   
## [61] shape_1.4.4        stringi_1.4.3      Rcpp_1.0.2        
## [64] tidyselect_0.2.5   xfun_0.8

License

This tutorial is distributed under an MIT license.

Acknowledgement

I would like to acknowledge Lutz Freytag, who edited this document and provided the RUTA section.