Introduction

PCA is becoming vary popular in various domains, including finance, biology, and social sciences. However, PCA results can be sensitive to changes in input data, which makes it crucial to assess the stability and robustness of the obtained principal components. Sensitivity analysis helps identify the factors that influence PCA results the most and allows researchers to determine the reliability of their conclusions.

Factors to consider when applying sensitivity analysis

Data preprocessing:

Data preprocessing methods like centering, scaling, or transformation can significantly affect the PCA results. It is essential to perform sensitivity analysis to assess how changes in preprocessing can influence the principal components.

# Load required packages
library(tidyverse)
library(FactoMineR)
library(factoextra)

# Load example data
data(iris)
X <- iris[, -5]

# Apply PCA with and without scaling
pca_without_scaling <- PCA(X, scale.unit = FALSE)

pca_with_scaling <- PCA(X, scale.unit = TRUE)

# Visual comparison
fviz_pca_ind(pca_without_scaling, geom = "point", col.ind = "cos2") + ggtitle("PCA without scaling")

fviz_pca_ind(pca_with_scaling, geom = "point", col.ind = "cos2") + ggtitle("PCA with scaling")

Outliers:

Outliers can distort the PCA results by exerting a disproportionate influence on the principal components. Sensitivity analysis should include the assessment of the impact of outliers on the results.

# Load required packages
library(tidyverse)
library(factoextra)

# Load example data
data(iris)
X <- iris[, -5]

# Apply PCA with scaling
pca_with_scaling <- prcomp(X, scale. = TRUE)

# Calculate scores for the first two principal components
scores <- pca_with_scaling$x[, 1:2]

# Identify outliers using Mahalanobis distance
mahalanobis_dist <- mahalanobis(scores, colMeans(scores), cov(scores))
outliers <- which(mahalanobis_dist > qchisq(0.975, df = 2))

# Compare PCA with and without outliers
pca_without_outliers <- prcomp(X[-outliers, ], scale. = TRUE)

# Visual comparison
fviz_pca_ind(pca_with_scaling, geom = "point", col.ind = "cos2") + ggtitle("PCA with outliers")

fviz_pca_ind(pca_without_outliers, geom = "point", col.ind = "cos2") + ggtitle("PCA without outliers")

Variable selection:

The choice of variables included in the PCA can have a significant impact on the results. Sensitivity analysis should explore different variable combinations to evaluate their influence on the principal components.

# Perform PCA with different variable combinations
pca_comb1 <- PCA(X[, c(1, 2, 3)], scale.unit = TRUE)

pca_comb2 <- PCA(X[, c(1, 2, 4)], scale.unit = TRUE)

# Visual comparison
fviz_pca_ind(pca_comb1, geom = "point", col.ind = "cos2") + ggtitle("PCA with variables 1, 2, and 3")

fviz_pca_ind(pca_comb2, geom = "point", col.ind = "cos2") + ggtitle("PCA with variables 1, 2, and 4")

Sample size and data variability:

Smaller sample sizes and low data variability can increase the sensitivity of PCA results. Researchers should examine the robustness of the PCA results by varying sample sizes and assessing the impact of data variability.

# Randomly select different sample sizes
set.seed(42)
small_sample <- X[sample(nrow(X), 30), ]
large_sample <- X[sample(nrow(X), 120, replace = TRUE), ]

# Perform PCA with different sample sizes
pca_small_sample <- PCA(small_sample, scale.unit = TRUE)

pca_large_sample <- PCA(large_sample, scale.unit = TRUE)

# Visual comparison
fviz_pca_ind(pca_small_sample, geom = "point", col.ind = "cos2") + ggtitle("PCA with small sample size")

fviz_pca_ind(pca_large_sample, geom = "point", col.ind = "cos2") + ggtitle("PCA with large sample size")

Data perturbation:

Sensitivity analysis should also evaluate the impact of small changes in the data values on PCA results. This can be achieved by perturbing the original data with random noise.

# Perturb the original data with random noise
set.seed(42)
noise <- rnorm(n = nrow(X) * ncol(X), mean = 0, sd = 0.1)
perturbed_data <- X + matrix(noise, nrow = nrow(X), ncol = ncol(X))

# Perform PCA on perturbed data
pca_perturbed_data <- PCA(perturbed_data, scale.unit = TRUE)

# Visual comparison
fviz_pca_ind(pca_with_scaling, geom = "point", col.ind = "cos2") + ggtitle("PCA with original data")

fviz_pca_ind(pca_perturbed_data, geom = "point", col.ind = "cos2") + ggtitle("PCA with perturbed data")

Conclusion

Sensitivity analysis is crucial for assessing the stability and robustness of PCA results. By considering factors such as data preprocessing, outliers, variable selection, sample size, data variability, and data perturbation, researchers can better understand the reliability of their PCA results and draw more accurate conclusions. Implementing sensitivity analysis using R markdown code examples, as demonstrated in this paper, can help researchers easily visualize and compare PCA results under different conditions.