This tutorial relies on the following R packages for plotting:
library(ggplot2)
library(ggrepel)
library(cowplot)
Dimension reduction is the process of reducing the number of variables (also sometimes referred to as features or of course dimensions) to a set of values of variables called principal variables. The main property of principal variables is the preservation of the structure and information carried by the original variables, to some extent. Principal variables are ordered by importance, with the first variable preserving the most structure and following variables preserving successively less. In other words applying dimension reduction to a data set creates a data set with the same number of observations but less variables, whilst maintaining the general βflavorβ π of the data set.
There are many different dimension reduction techniques that all try to find a representation of the data in a lower dimensional space that retains as much information as possible. In general there are two classes of dimension reduction:
Feature Elimination is removing some variables completely if they are redundant with some other variable or if they are not providing any new information about the data set. The advantage of feature elimination is that it is simple to implement and makes our data set small, including only variables in which we are interested. But as a disadvantage β we might lose some information from the variables which we dropped.
Feature Extraction on the other hand, is the formation of new variables from the old variables. The advantage of feature extraction is that it preserves the most structure of your original data set. But as a disadvantage - the newly created variables loose their interpretation.
In this tutorial we will look at some of the most commonly applied dimension reduction methods of the class of feature extraction. We will discuss general pitfalls and effective use of these dimension reduction techniques. Feature elimination will not be discussed in this tutorial, but good discussions/tutorials can be found here and here.
Dimension reduction via feature extraction (hereafter referred to as simply dimension reduction) is an extremely powerful statistical tool and thus is frequently applied. Some of the main uses of dimension reduction are:
However to really understand the power of even simple dimension reduction techniques, let us consider the real data set cereals
. The data set contains 77 cereals with 15 variables:
library(plspm)
data("cereals")
head(cereals)
## mfr type calories protein fat sodium fiber carbo
## 100%_Bran N C 70 4 1 130 10.0 5.0
## 100%_Natural_Bran Q C 120 3 5 15 2.0 8.0
## All-Bran K C 70 4 1 260 9.0 7.0
## All-Bran_with_Extra_Fiber K C 50 4 0 140 14.0 8.0
## Almond_Delight R C 110 2 2 200 1.0 14.0
## Apple_Cinnamon_Cheerios G C 110 2 2 180 1.5 10.5
## sugars potass vitamins shelf weight cups
## 100%_Bran 6 280 25 3 1 0.33
## 100%_Natural_Bran 8 135 0 3 1 1.00
## All-Bran 5 320 25 3 1 0.33
## All-Bran_with_Extra_Fiber 0 330 25 3 1 0.50
## Almond_Delight 8 -1 25 3 1 0.75
## Apple_Cinnamon_Cheerios 10 70 25 1 1 0.75
## rating
## 100%_Bran 68.40297
## 100%_Natural_Bran 33.98368
## All-Bran 59.42551
## All-Bran_with_Extra_Fiber 93.70491
## Almond_Delight 34.38484
## Apple_Cinnamon_Cheerios 29.50954
For the moment letβs discount the categorical variables. Even then we have 12 variables.
numeric_vars <- c(
"calories", "protein", "fat", "sodium", "fiber", "carbo",
"sugars", "potass", "vitamins", "weight", "cups", "rating"
)
cereals_num <- cereals[, numeric_vars]
ncol(cereals_num)
## [1] 12
(If you have problems with the dataset you can also download it here.)
Now letβs imagine you are interested in two particular subsets of cereals: the fun ones π‘ and the shredded wheat πͺ ones.
fun_cereals <- c(
"Lucky_Charms", "Count_Chocula", "Froot_Loops",
"Frosted_Flakes", "Cocoa_Puffs", "Cinnamon_Toast_Crunch"
)
shredded_wheat_cereals <- c(
"Shredded_Wheat", "Shredded_Wheat_'n'Bran",
"Shredded_Wheat_spoon_size"
)
cereals_num_sub <- match(
c(fun_cereals, shredded_wheat_cereals),
rownames(cereals_num)
)
cereals_num$classification <- "normal"
cereals_num$classification[match(fun_cereals, rownames(cereals_num))] <- "fun"
cereals_num$classification[match(
shredded_wheat_cereals,
rownames(cereals_num)
)] <- "shredded"
cereals_num$label <- ""
cereals_num$label[cereals_num_sub] <- rownames(cereals_num)[cereals_num_sub]
Imagine you want to answer the following questions regarding those:
Using simple visualization techniques this is incredibly hard and often leads to unsatisfactory answers. To demonstrate this we will just consider fats, calories, sodium and sugars. We can see that in terms of amount of sugars shredded cereals πͺ are alike and so are fun cereals π’, but they are very different from each other. However in terms of amounts of fats they are more similar. So are they similar or different? And how would our assessment change if we added the other 8 numerical variables in the cereals
data set, that we have currently discounted?
library(GGally)
ggpairs(cereals_num,
columns = c("fat", "calories", "sodium", "sugars"),
ggplot2::aes(colour = classification)
)
Clearly, we need a way to visualize most of complexity represented in all dimensions in just two or three dimensions. Fortunately, that is exactly what dimension reduction does. Here, we apply a simple dimension reduction technique called principal component analysis (PCA) to all 12 numerical variables in the data set. This does indeed confirm that the shredded wheat cereals πͺ are very similar in the two most variable directions of the space created by the data (i.e.Β the directions where the data is most spread out). They are also very different from the fun cereals π‘.
However also note that the newly created variables have lost their interpretation. They represent a combination of the 12 numerical variables. So we cannot know for what reasons (i.e.Β which variables precisely) the fun cereals π‘ are similar. We just know that considering all 12 numerical variables that the fun cereals π‘ are similar.
prcomp_cereals_num <- prcomp(cereals_num[, 1:12])
pca_cereals_num <- data.frame(
PC1 = prcomp_cereals_num$x[, 1],
PC2 = prcomp_cereals_num$x[, 2],
label = cereals_num$label,
classification = cereals_num$classification
)
ggplot(pca_cereals_num, aes(x = PC1, y = PC2, label = label, col = classification)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
Using the principal component representation, we can now easily find cereals that are similar to some of the fun cereals π‘. These will have similar values of the principal components as the fun cereals π’. According to the first two principal components other fun cereals could be Honeycomb, CapβnβCrunch or Honey Graham Ohs. Notice that even the packaging of these cereals is much more similar to the packaging of the fun cereals π‘ than the shredded cereals πͺ.
cereals_num_int <- which(pca_cereals_num$PC1 > 0 &
pca_cereals_num$PC1 < 75 &
pca_cereals_num$PC2 > 25)
pca_cereals_num$label2 <- ""
pca_cereals_num$label2[cereals_num_int] <- rownames(cereals_num)[cereals_num_int]
pca_cereals_num$label2[cereals_num_sub] <- ""
ggplot(pca_cereals_num, aes(x = PC1, y = PC2, label = label2, col = classification)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
A quick word about the output of the function prcomp
that calculates the principal component analysis. The output is an object that contains multiple results, but most notably:
$sdev
The standard deviation of every principal component$rotation
The relationship between the initial variables and the principal components. This may be of interest to you for feature selection purposes, etc.$x
The values of the principal components for every sample. This will be the most interesting to you, as it can be used for visualization purposes, etc.For more information on this, run ?prcomp
π». Also note that most other implementations of dimension reduction techniques in R output multiple results. In order to understand the output use a combination of search and the R function str
, as demonstrated below.
str(prcomp_cereals_num)
## List of 5
## $ sdev : num [1:12] 84.83 71.37 22.38 18.87 8.63 ...
## $ rotation: num [1:12, 1:12] 0.076192 -0.001398 -0.000117 0.983022 -0.004605 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:12] "calories" "protein" "fat" "sodium" ...
## .. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:12] 106.88 2.55 1.01 159.68 2.15 ...
## ..- attr(*, "names")= chr [1:12] "calories" "protein" "fat" "sodium" ...
## $ scale : logi FALSE
## $ x : num [1:77, 1:12] -54.2 -147.7 70 -53.2 50.6 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:77] "100%_Bran" "100%_Natural_Bran" "All-Bran" "All-Bran_with_Extra_Fiber" ...
## .. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
We will now go over some of the most common reduction techniques. I will try to point out their common use cases, their strengths and their weaknesses. This is by no means an exhaustive list. This is a very active area of research and its close ties with machine learning have meant an explosion of new methods.
Principal component analysis (PCA) is a statistical procedure that uses an rotation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. PCA is mostly used as a tool in exploratory data analysis and for making predictive models. It is often a vital step in the determination of more complex dimension reductions.
We have already applied PCA to the data in the earlier example.
Strength
π can be efficiently applied to large data sets
π has many extensions that deal with special situation such as sparse PCA when data has many missing values, kernel PCA can provide more robust solutions
π preserves global structures
Weaknesses
π¨ may suffer from scale problems, i.e.Β when one variable dominates the variance simply because it is in a higher scale (for example if one variable is measured in mm instead of km); some of these scale problems can simply be dealt with by centering and scaling
π¨ suffers from crowding in the presence of large numbers of observations, i.e.Β points do not form clusters but instead occupy the entire plotting space
πΏ is susceptible to big outliers
Multidimensional scaling (MDS) is an algorithm that given a distance matrix with the distances between each pair of objects in a set, and a chosen number of dimensions, N, places each object into N-dimensional space such that the between-object distances are preserved as well as possible. MDS is most often used as a visualization tool. It is best suited to the problem that involve real distances, i.e.Β city distances or geometry. It yields the same result as PCA when Euclidean distances are used.
dist_cereals_num <- dist(cereals_num[, 1:12], method = "canberra")
mds_cereals_num <- cmdscale(dist_cereals_num, eig = TRUE, k = 2)
mds_cereals_num <- data.frame(
MDS1 = mds_cereals_num$points[, 1],
MDS2 = mds_cereals_num$points[, 2],
label = cereals_num$label,
classification = cereals_num$classification
)
ggplot(mds_cereals_num, aes(
x = MDS1, y = MDS2,
label = label, col = classification
)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
Letβs again check what other cereals are similar to the fun cereals π‘. We can see that all three cereals - Honeycomb, CapβnβCrunch and Honey Graham Ohs - are classified as similar to the fun ones π’. This points to the inherent similarity of MDS and PCA in simple cases.
cereals_num_int <- which(mds_cereals_num$MDS1 < -0.5 &
mds_cereals_num$MDS2 < 0 &
mds_cereals_num$MDS2 > -2.5)
mds_cereals_num$label2 <- ""
mds_cereals_num$label2[cereals_num_int] <- rownames(cereals_num)[cereals_num_int]
mds_cereals_num$label2[cereals_num_sub] <- ""
ggplot(mds_cereals_num, aes(
x = MDS1, y = MDS2,
label = label2, col = classification
)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
Strength
π flexible as any distance metric can be used
π preserves global structures
Weaknesses
π computationally demanding and extremely inefficient for large numbers of observations
π¨ suffers from crowding in the presence of large number of observations
Independent component analysis (ICA) attempts to decompose a multivariate signal into statistically independent non-Gaussian signals. ICA is important to blind signal separation and has many practical applications. The most common is applications for sound signals. A simple application of ICA is the βcocktail party problemβ, where the underlying speech signals are separated from a sample data consisting of people talking simultaneously in a room. Usually the problem is simplified by assuming no time delays or echoes. Note that a filtered and delayed signal is a copy of a dependent component, and thus the statistical independence assumption is not violated.
library(ica)
ica_cereals_num <- icafast(cereals_num[, 1:12], 2,
center = TRUE, maxit = 100,
tol = 1e-6
)
ica_cereals_num <- data.frame(
ICA1 = ica_cereals_num$Y[, 1],
ICA2 = ica_cereals_num$Y[, 2],
label = cereals_num$label,
classification = cereals_num$classification
)
ggplot(ica_cereals_num, aes(
x = ICA1, y = ICA2,
label = label, col = classification
)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
Once again check what other cereals are similar to the fun cereals π’. Like before, the selection of similar cereals to the fun cereals include the three previously identified.
It should be noted that the application of ICA in this scenario is likely to violate one of the key assumptions. It is extremely plausible that at least one of the 12 variables has a normal distribution.
cereals_num_int <- which(ica_cereals_num$ICA1 > 0.25 &
ica_cereals_num$ICA1 < 1 &
ica_cereals_num$ICA2 < 2 &
ica_cereals_num$ICA2 > 0)
ica_cereals_num$label2 <- ""
ica_cereals_num$label2[cereals_num_int] <- rownames(cereals_num)[cereals_num_int]
ica_cereals_num$label2[cereals_num_sub] <- ""
ggplot(ica_cereals_num, aes(
x = ICA1, y = ICA2,
label = label2, col = classification
)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
Strength
π separating multiple independent sources of signal
π can be efficiently applied to large data sets
π preserves global structures
Weaknesses
π has very specific assumptions, especially that none of the independent sources are normally distributed
π suffers from crowding in the presence of large number of observations
T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimension reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. The main application of t-SNE is for visualizations of large amount of data, as it can recover well-separated clusters.
library(Rtsne)
tsne_cereals_num <- Rtsne(cereals_num[, 1:12],
pca = FALSE, perplexity = 10,
theta = 0.0
)
tsne_cereals_num <- data.frame(
TSNE1 = tsne_cereals_num$Y[, 1],
TSNE2 = tsne_cereals_num$Y[, 2],
label = cereals_num$label,
classification = cereals_num$classification
)
ggplot(tsne_cereals_num, aes(
x = TSNE1, y = TSNE2,
label = label, col = classification
)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
Once again check what other cereals are similar to the fun cereals π’. Note that this time we get less cereals because the clustering is more pronounced than in the earlier dimension reductions. Even so all three cereals previously identified are once again seen.
cereals_num_int <- which(tsne_cereals_num$TSNE2 < -10 &
tsne_cereals_num$TSNE2 > -45 &
tsne_cereals_num$TSNE1 < 25 &
tsne_cereals_num$TSNE1 > 10)
tsne_cereals_num$label2 <- ""
tsne_cereals_num$label2[cereals_num_int] <- rownames(cereals_num)[cereals_num_int]
tsne_cereals_num$label2[cereals_num_sub] <- ""
ggplot(tsne_cereals_num, aes(
x = TSNE1, y = TSNE2,
label = label2, col = classification
)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
Strength
π₯° recovers well-separated clusters
π perseveres local neighborhoods
πΌ no crowding of observations
Weaknesses
π stochastic, so unless you set a seed your results are not reproducible
π computationally expensive with extremely large number of observations
π’ one parameter that can be very influential (see [Pitfalls of dimension reduction])
π can produce artifacts (see [Pitfalls of dimension reduction])
π― no preservation of global structures
Uniform Manifold Approximation and Projection (UMAP) produce a low dimensional representation of high dimensional data that preserves relevant structure. It searches for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure to the original high dimensional data. UMAP is similar to t-SNE but preserves more global structure in the data. Its main use is visualization of a large amount of observations as it recovers well-separated clusters.
library(uwot)
umap_cereals_num <- umap(cereals_num[, 1:12],
n_neighbors = 15,
min_dist = 1, spread = 5
)
umap_cereals_num <- data.frame(
UMAP1 = umap_cereals_num[, 1],
UMAP2 = umap_cereals_num[, 2],
label = cereals_num$label,
classification = cereals_num$classification
)
ggplot(umap_cereals_num, aes(
x = UMAP1, y = UMAP2,
label = label, col = classification
)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
Once again check what other cereals are similar to the fun cereals π‘. Note that the fun cereals in this representation are much more spread out. Thus we have a harder time identifying similar cereals to the fun cereals π‘. UMAP works best when we have a large number of variables. Clearly, this situation is not given here and thus we may want to reconsider our use of UMAP for this situation.
cereals_num_int <- which(umap_cereals_num$UMAP1 < 2 &
umap_cereals_num$UMAP1 > -10 &
umap_cereals_num$UMAP2 > 2 &
umap_cereals_num$UMAP2 < 12)
umap_cereals_num$label2 <- ""
umap_cereals_num$label2[cereals_num_int] <- rownames(cereals_num)[cereals_num_int]
umap_cereals_num$label2[cereals_num_sub] <- ""
ggplot(umap_cereals_num, aes(
x = UMAP1, y = UMAP2,
label = label2, col = classification
)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
Strength
π recovers well-separated clusters
π computationally efficient
π preserves global distances
Weaknesses
π somewhat stochastic
π± parameters to optimize
β‘ can produce artifact
Autoencoders are neural networks that are trained to reconstruct their original inputs. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimension reduction. This is achieved by designing a neural network with a bottleneck (to avoid overfitting) and training the network to ignore signal βnoiseβ. The idea of autoencoders has been popular in the field of neural networks for decades, but is most commonly used for denoising pictures or sound. As with any neural network there is a lot of flexibility in how autoencoders can be constructed such as the number of hidden layers and the training strategy (i.e.Β variational autoencoders). Hence there are many different implementations of autoencoders for the purpose of dimension reduction. Here we will look at RUTA, however if you want to learn how to construct your own autoencoder in R, please refer to this.
library(ruta)
x <- as.matrix(cereals_num[, 1:12])
ruta_cereals_num <- autoencode(scale(x), 2, type = "robust",
activation = "tanh")
ruta_cereals_num <- data.frame(
RUTA1 = ruta_cereals_num[, 1],
RUTA2 = ruta_cereals_num[, 2],
label = cereals_num$label,
classification = cereals_num$classification
)
ggplot(ruta_cereals_num, aes(x = RUTA1, y = RUTA2,
label = label, col = classification)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
We do not check the fun π‘ cereals for RUTA, since its results appear unstable. Like UMAP, RUTA does best when there is a large number of variables. Furthermore, to work effectively it requires a large number of observations.
Strength
π preserves global data structures
π new points can be added without recomputing
Weaknesses
π large number of observations required for effective training
π« computationally more expensive
β οΈ stochastic
π¦ parameter selection, especially neural network selection and training
Potential of heat-diffusion for affinity-based trajectory embedding (PHATE) is a non-linear embedding algorithm , like UMAP and autoencoders. PHATE simultaneously denoises the data and emphasizes the continuous nature of any underlying progressions and trajectories. PHATE naturally uncovers branching progression structures in the data in very low dimensions to enable visualization. PHATE is typically used for visualizations, but can also be used to recover trajectories.
library(phateR)
x <- as.matrix(cereals_num[, 1:12])
phate_cereals_num <- phate(scale(x))
phate_cereals_num <- data.frame(
PHATE1 = phate_cereals_num$embedding[, 1],
PHATE2 = phate_cereals_num$embedding[, 2],
label = cereals_num$label,
classification = cereals_num$classification
)
ggplot(phate_cereals_num, aes(x = PHATE1, y = PHATE2,
label = label, col = classification)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
Once again check what other cereals are similar to the fun cereals π‘. Note that this time we an extreme separation between the fun π’ and the shredded πͺ cereals. We do identify all cereals previously identified.
cereals_num_int <- which(phate_cereals_num$PHATE1 > 0)
phate_cereals_num$label2 <- ""
phate_cereals_num$label2[cereals_num_int] <- rownames(cereals_num)[cereals_num_int]
phate_cereals_num$label2[cereals_num_sub] <- ""
ggplot(phate_cereals_num, aes(
x = PHATE1, y = PHATE2,
label = label2, col = classification
)) +
geom_point() +
ggrepel::geom_text_repel(cex = 2.5)
Strength
π preserves global structures
π computationally fast
Weaknesses
π large number of observations required
π¦ parameter selection
The difficulty in applying dimension reduction is that each dimension reduction technique is designed to maintain certain aspects of the original data and therefore may be appropriate for one task and inappropriate for another. For many applications it can be difficult to objectively find the right method or parameterization for the dimension reduction task. To find the right dimension reduction technique for your data consider the nature and resolution of your data. Are you interested or able to capture only small-scale relationships or are you focusing on long-range relationships. In the first case you may want to only consider dimension reduction techniques that preserve local structure and in the second case you should focus on dimension reduction techniques that preserve global structure.
To understand whether your chosen algorithms distort distances you can make use of the dimRed
package. In particular, this package provides a useful visualization. This visualization shows the βcorrectnessβ of the dimension reduction (R_NX) for different numbers of nearest neighbors (K). You can use this in order to see whether distances get distorted when you vary the number of neighbors considered. Here you can see that for such a simple example as the cereal
data set, simple approaches like PCA and ICA are much better at preserving distances than UMAP. Note that this package does not necessarily provide every dimension reduction but it is excellent for developing an intuition for your data.
library(dimRed)
embed_methods <- c("PCA", "FastICA", "UMAP")
data_emb <- lapply(embed_methods, function(x)
embed(cereals_num[, 1:12], x))
names(data_emb) <- embed_methods
plot_R_NX(data_emb)
Dimension reduction techniques can be used to create representations of data that are significantly smaller and thus use less memory. Simply, put by taking advantage of multidisciplinary we can remove mostly redundant information. However, once removed this information cannot be restored. Also your new data will have variables that represent combinations of your original variables. So use with care π .
Although extremely useful, dimension reduction can sometimes be mysterious or misleading. In order to learn how to use effectively it is good to develop an intuition and be aware of its pitfalls.
You may have noticed that some of the dimension reduction techniques, but especially the more complex ones, require parameters to be set. These parameters control important aspects of the dimension reduction and can really influence your resulting perception of the data. In order to demonstrate this, we will look at t-SNE and its tuneable parameter perplexity. Perplexity controls how to balance the attention between local π and global π aspects of your data. In particular, notice that when the perplexity is low arbitrary small clusters start forming. Also notice that with changing perplexity the distance between shredded πͺ and fun cereals π’ changes. This points to the problem that distances, especially long distances, in t-SNE plots are meaningless and should not be interpreted.
perplexity <- seq(2, 20, 4)
tsne_perp_cereals_num <- lapply(perplexity, function(x)
Rtsne(cereals_num[, 1:12], pca = FALSE, perplexity = x, theta = 0.0))
tsne_perp_cereals_num <- lapply(tsne_perp_cereals_num, function(x)
data.frame(
TSNE1 = x$Y[, 1],
TSNE2 = x$Y[, 2],
label = cereals_num$label,
classification = cereals_num$classification
))
all_plots <- lapply(tsne_perp_cereals_num, function(x) {
pp <- ggplot(x, aes(x = TSNE1, y = TSNE2, col = classification)) +
geom_point() + guides(col = FALSE)
pp
})
plot_grid(plotlist = all_plots, labels = perplexity, label_size = 12)
Note that other dimension reduction techniques, such as UMAP, also have parameters that have a strong influence on the results.
π©βπ» Advanced Exerice π¨βπ»
Use the package dimRed
in order to plot the distance distortion experienced for a range of perplexity values when applying t-SNE to the cereals
data set.
Some of the more complex dimension reduction techniques, and in particular t-SNE, have random components π·. Thus even when you run them twice with exactly the same parameters you will obtain slightly different answers. Because of this inherent randomness it is also important to run dimension reduction techniques, t-SNE multiple times in order to assess the robustness of the results.
tsne_ran_cereals_num <- lapply(1:2, function(x)
Rtsne(cereals_num[, 1:12], pca = FALSE, perplexity = 10, theta = 0.0))
tsne_ran_cereals_num <- lapply(tsne_ran_cereals_num, function(x)
data.frame(
TSNE1 = x$Y[, 1],
TSNE2 = x$Y[, 2],
label = cereals_num$label,
classification = cereals_num$classification
))
all_plots <- lapply(tsne_ran_cereals_num, function(x) {
pp <- ggplot(x, aes(x = TSNE1, y = TSNE2, col = classification)) +
geom_point() + guides(col = FALSE)
pp
})
plot_grid(plotlist = all_plots)
Unfortunately, the inherent randomness of some dimension reduction techniques makes them a real headache π€ for reproducible analysis. To ensure that you obtain the same results every time you will need to set a seed using the function set.seed
.
tsne_ran_cereals_num <- lapply(1:2, function(x) {
set.seed(1)
Rtsne(cereals_num[, 1:12], pca = FALSE, perplexity = 10, theta = 0.0)
})
tsne_ran_cereals_num <- lapply(tsne_ran_cereals_num, function(x)
data.frame(
TSNE1 = x$Y[, 1],
TSNE2 = x$Y[, 2],
label = cereals_num$label,
classification = cereals_num$classification
))
all_plots <- lapply(tsne_ran_cereals_num, function(x) {
pp <- ggplot(x, aes(x = TSNE1, y = TSNE2, col = classification)) +
geom_point() + guides(col = FALSE)
pp
})
plot_grid(plotlist = all_plots)
Another common pitfall of dimension reduction is that random noise might look like a pattern. Recognizing noise when you see it is a critical skill, but it takes time to build up the right intuitions. A tricky thing about more complex dimension reductions is that they throw a lot of existing intuition out the window. We will demonstrate this by simulating 100 random standard Gaussian observations in 100 dimensions.
set.seed(10)
sim_data <- sapply(1:100, function(x) rnorm(100, 1, 1))
tsne_sim_data <- Rtsne(sim_data, pca = FALSE, perplexity = 2, theta = 0.0)
tsne_sim_data <- data.frame(
TSNE1 = tsne_sim_data$Y[, 1],
TSNE2 = tsne_sim_data$Y[, 2]
)
pca_sim_data <- prcomp(sim_data)
pca_sim_data <- data.frame(
PCA1 = pca_sim_data$x[, 1],
PCA2 = pca_sim_data$x[, 2]
)
p1 <- ggplot(tsne_sim_data, aes(x = TSNE1, y = TSNE2)) +
geom_point() + ggtitle("TNSE")
p2 <- ggplot(pca_sim_data, aes(x = PCA1, y = PCA2)) +
geom_point() + ggtitle("PCA")
plot_grid(p1, p2)
π¨βπ» Advanced Exerice π¨βπ»
Investigate this phenomenon with other types of non-Gaussian random noise and for other complex dimension reductions.
In this last section I will present some tips of how to work effectively with dimension reduction techniques.
We have already touched upon the need to scale the data in order to ensure that each variable contributes equally. However, depending on the nature of the data other data transformations may be required. For example, if changes in your data are multiplicative, e.g., your variables measure percent increase/decrease, you should consider using a log-transform before applying dimension reductions such as PCA.
So far we have really only looked at the first two dimensions obtained after dimension reduction. However, this was merely a choice. In fact dimensions beyond the first two, often contain vital information. One way to check this is to look at the variance explained by each principle component. In fact it is good practice when using PCA to state the explained variance by every principle component plotted. However, this being said. It makes little sense to present dimension reduction plots with arbitrarily chosen dimensions, like dimension 2 versus dimension 5, if you are not also willing to present all preceding plots or have explanations for why these are not of interest.
prcomp_cereals_num <- prcomp(cereals_num[, 1:12])
var_exp <- round(prcomp_cereals_num$sdev[1:4] / sum(prcomp_cereals_num$sdev) * 100, 2)
pca_cereals_num <- data.frame(
PC1 = prcomp_cereals_num$x[, 1],
PC2 = prcomp_cereals_num$x[, 2],
PC3 = prcomp_cereals_num$x[, 3],
PC4 = prcomp_cereals_num$x[, 4],
label = cereals_num$label,
classification = cereals_num$classification
)
comb <- list(
first = c(1, 2), second = c(1, 3), third = c(1, 4), fourth = c(2, 3),
fifth = c(2, 4), sixth = c(3, 4)
)
gg <- lapply(comb, function(x) {
ggplot(pca_cereals_num, aes_string(
x = paste0("PC", x[1]),
y = paste0("PC", x[2]), col = "classification"
)) +
geom_point() + guides(col = FALSE) + ylab(paste0(
"PC", x[2], ":(",
var_exp[x[2]], "%)"
)) + xlab(paste0("PC", x[1], ":(", var_exp[x[1]], "%)"))
})
plot_grid(plotlist = gg)
So how do you choose a suitable number of dimensions?
Obviously this depends on your data and will require you to look at your data π€. For dimension reduction techniques based on spectral decomposition, such as PCA, you could use the distribution of the eigenvalues and find the point where the explained variance significantly drops. Typically this is done via a scree plot. It turns out that the point where the explained variance significantly drops is after 2 dimensions.
var_exp <- data.frame(
"Variance explained" =
prcomp_cereals_num$sdev / sum(prcomp_cereals_num$sdev) * 100,
"Number of dimension" = 1:length(prcomp_cereals_num$sde),
check.names = FALSE
)
ggplot(var_exp, aes(x = `Number of dimension`, y = `Variance explained`)) +
geom_col() + geom_line()
Note that this step is particularly important when dimension reduction is applied as a preprocessing step preceding statistical analyses or machine learning tasks (e.g., clustering). However, even when your primary goal is data visualization, in which only two or three axes can be displayed at a time, you still need to select a sufficient number of new features to generate.
We have already seen that parameters can seriously affect the results of dimension reduction techniques. If you are applying a dimension reduction technique with parameters, it is highly advisable to check how results change when parameters are varied. Only patterns that are stable across a range of parameters can be trusted.
We have also mentioned that outliers can have a big effect on linear dimension reduction techniques, such as PCA. You should be mindful of situations where all of the points in a plot are located close to the origin (the center of the plot), with only one or a few points lying very far away, as this points towards the dimension reduction being dominated by outlier points. To be able to continue you will need to reconsider your quality control metrics and possible removal of these points.
So far we have exclusively worked with numerical data, but what if your data is categorical?
If your data consists only of categorical data, you might want to give correspondence analysis (CA) or multiple correspondance analysis (MCA) a try. If you have mixed data that consists of both categorical and numerical data, the appropriate method is optimal scaling categorical PCA (CATPCA), which converts original levels of categorical variables with category quantifications such that the variance in the new variables is maximized. An advantage of optimal scaling is that it does not assume a linear relationship between variables. In fact, the ability of CATPCA to handle nonlinear relations between variables is important even when the input data are all numeric.
A simpler way to handle a small number of categorical variable is to visualize them on the dimension reduction plot, as done in tip 4.
π¨βπ» Advanced Exerice π¨βπ»
Try applying CATPCA to the entirety of the cereals
data using the package princals
.
π General Exerice π
Investigate the wine
data set which contains the results of a chemical analysis of wines grown in a specific area of Italy. Three types of wine are represented in the 178 samples, with the results of 13 chemical analyses recorded for each sample. The data set can be found in the package rattle.data
.
Apply a simple PCA to the data and visualize the result. Remember to include the explained variance for each dimension and make a decision on how many dimensions are required.
Overlay the plot/plots with the information regarding the type of wine. What conclusions can you draw?
Apply a more complex dimension reduction and investigate how your results change.
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggfortify_0.4.7 dimRed_0.2.3 DRR_0.0.3 CVST_0.2-2
## [5] kernlab_0.9-27 phateR_0.4.1 ruta_1.1.0 uwot_0.1.3
## [9] Matrix_1.2-17 Rtsne_0.15 ica_1.0-2 GGally_1.4.0
## [13] plspm_0.4.9 cowplot_1.0.0 ggrepel_0.8.1 ggplot2_3.2.1
## [17] emo_0.0.0.9000 here_0.1
##
## loaded via a namespace (and not attached):
## [1] tidyr_0.8.3 jsonlite_1.6 R.utils_2.9.0
## [4] RcppParallel_4.4.3 assertthat_0.2.1 amap_0.8-17
## [7] yaml_2.2.0 diagram_1.6.4 pillar_1.4.2
## [10] backports_1.1.4 lattice_0.20-38 glue_1.3.1
## [13] reticulate_1.13 digest_0.6.20 RColorBrewer_1.1-2
## [16] colorspace_1.4-1 fastICA_1.2-2 htmltools_0.3.6
## [19] R.oo_1.22.0 plyr_1.8.4 turner_0.1.7
## [22] pkgconfig_2.0.2 purrr_0.3.2 scales_1.0.0
## [25] whisker_0.3-2 RSpectra_0.15-0 tester_0.1.7
## [28] tibble_2.1.3 generics_0.0.2 withr_2.1.2
## [31] umap_0.2.2.0 keras_2.2.4.1 coRanking_0.1.4
## [34] lazyeval_0.2.2 magrittr_1.5 crayon_1.3.4
## [37] evaluate_0.14 R.methodsS3_1.7.1 FNN_1.1.3
## [40] tools_3.6.1 stringr_1.4.0 munsell_0.5.0
## [43] compiler_3.6.1 rlang_0.4.0 grid_3.6.1
## [46] base64enc_0.1-3 labeling_0.3 rmarkdown_1.14
## [49] gtable_0.3.0 reshape_0.8.8 reshape2_1.4.3
## [52] R6_2.4.0 gridExtra_2.3 tfruns_1.4
## [55] lubridate_1.7.4 knitr_1.24 dplyr_0.8.3
## [58] tensorflow_1.14.0 zeallot_0.1.0 rprojroot_1.3-2
## [61] shape_1.4.4 stringi_1.4.3 Rcpp_1.0.2
## [64] tidyselect_0.2.5 xfun_0.8
This tutorial is distributed under an MIT license.
I would like to acknowledge Lutz Freytag, who edited this document and provided the RUTA section.