Principal Component Analysis in R

主成分分析（Principal Component Analysis, PCA） 是一种非常实用的探索性数据分析方法。它能帮助我们在包含大量变量的数据集中，更直观地可视化数据的变化趋势。
尤其当数据是“宽型”（wide dataset）时——也就是每个样本拥有许多变量的情况——PCA 显得尤为有用。

在本文档中，你将学习：

什么是 PCA，以及主成分与特征值（eigenvalues）、特征向量（eigenvectors）之间的关系；
如何在一个简单的数据集上进行 PCA 分析；
如何绘制 PCA 结果的可视化图，并初步解释这些图形；
如何使用 ggbiplot 包来自定义绘图参数，使图形更美观、更具有表达力；
如何在已有 PCA 图上添加新的样本，并将其投影到原有主成分空间中；
最后，对整个 PCA 流程进行总结与回顾。

一、PCA 简介

正如前面提到的，PCA 特别适合用于“宽数据集”。
这是因为当一个数据集中包含大量变量时，直接绘制原始数据几乎不可能，很难直观地把握数据的分布规律或样本间的差异。

PCA 的作用就在于——
通过数学变换，将高维数据简化为少数几个综合指标（主成分），让我们能够从整体上看到数据的“形状”。
这样一来，我们就能识别出哪些样本彼此相似、哪些差异较大，并进一步分析哪些变量导致了这些差异。

二、PCA 的基本思想

虽然 PCA 背后的数学原理较为复杂，但核心思想其实相当直观：

我们希望找到一组新的“坐标轴”（主成分），使得在这些方向上数据的方差最大。

具体来说：

我们对原始的多个变量进行线性变换；
通过这种变换，得到一组新的变量——称为主成分（Principal Components, PCs）；
第一个主成分（PC1）是数据中方差最大的方向，也就是数据“最分散”的那条直线；
第二个主成分（PC2）则与第一个正交（垂直），表示方差次大的方向；
依此类推，每个新的主成分都与前一个保持正交，并解释数据中递减的方差信息。

换句话说，PCA 是一种线性变换，将原本存在相关性的变量集转化为一组互不相关的主成分，每个主成分都代表了数据中不同方向的变化趋势。

三、主成分与变量的关系

当原始变量之间存在较强相关性时，它们往往会共同作用于同一个主成分。
每个主成分都代表了数据总变异的一定比例：

前几个主成分通常能解释大部分变异；
随着主成分数的增加，模型对原始数据的近似程度越来越高，但解释性却变得更复杂。

因此，在分析时，我们常常只保留能解释80%~95%方差的前几个主成分，用于可视化与模式识别。

四、特征值与特征向量（Eigenvalues & Eigenvectors）

理解 PCA 的关键在于掌握“特征值-特征向量”的概念。

每个主成分都可以由一对**特征向量（eigenvector）和特征值（eigenvalue）**定义：

特征向量：表示方向（例如“水平”或“45°斜线”）；
特征值：表示该方向上数据的方差大小，也就是“重要性”。

第一个主成分对应的特征向量，具有最大的特征值；
第二个主成分的特征向量，特征值次之，且与第一个正交；
以此类推。

一个数据集有多少个维度，就会有多少对特征值与特征向量。
例如，一个二维数据集会有两个特征值-特征向量对；三维数据则有三对。

五、PCA 的几何意义

PCA 的实质是：
我们并没有改变数据本身，而是改变了观察它的角度。
通过使用特征向量重新定义坐标系，我们能以更“合理”的视角来表达原始数据，使其结构和差异更加明显。

这种“重构视角”的过程，让我们能够在复杂的多维数据中发现隐藏的规律和分布模式。

六、总结

主成分分析（PCA）是一种降维与模式识别的强大工具，尤其在以下情境中非常有用：

数据维度高、变量之间存在相关性；
需要探索样本分布、识别聚类趋势；
希望在二维或三维空间中可视化复杂数据结构。

在接下来的实践中，我们将通过 R 语言进行 PCA 实操，从数据预处理、主成分提取到结果可视化，全面掌握 PCA 的分析与解读方法。

# Principal Component Analysis of the mtcars dataset

# Select relevant variables for the PCA analysis
# Exclude vs (8) and am (9) columns, using columns 1:7, 10, 11
# These include metrics like mpg, cyl, disp, hp, drat, wt, qsec, gear, and carb
mtcars.pca <- prcomp(mtcars[,c(1:7,10,11)], center = TRUE, scale. = TRUE)

# Display summary statistics of the PCA
# This shows the importance of components, including standard deviation, proportion of variance, and cumulative proportion
summary(mtcars.pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413
## Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167
## Cumulative Proportion  0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103
##                           PC8     PC9
## Standard deviation     0.2419 0.14896
## Proportion of Variance 0.0065 0.00247
## Cumulative Proportion  0.9975 1.00000

# Create a basic biplot visualization of the PCA results
ggbiplot(mtcars.pca)

# Add car model names as labels to the biplot
ggbiplot(mtcars.pca, labels=rownames(mtcars))

# Create a grouping variable based on car origin
mtcars.country <- c(rep("Japan", 3), rep("US", 4), rep("Europe", 7), rep("US",3), "Europe",
          rep("Japan", 3), rep("US", 4), rep("Europe", 3), "US", rep("Europe", 3))
          
# Create an enhanced biplot with confidence ellipses grouped by country of origin
ggbiplot(mtcars.pca, ellipse=TRUE, labels=rownames(mtcars), groups=mtcars.country)

# Demonstration: Adding a new hypothetical car "spacecar" with extreme values
spacecar <- c(1000,60,50,500,0,0.5,2.5,0,1,0,0)
mtcarsplus <- rbind(mtcars, spacecar)  # Add the new car to the dataset
mtcars.countryplus <- c(mtcars.country, "Jupiter")  # Add origin for the new car

# Perform PCA on the extended dataset (including spacecar)
mtcarsplus.pca <- prcomp(mtcarsplus[,c(1:7,10,11)], center = TRUE, scale. = TRUE)

# Visualize the new PCA with the additional sample included in the analysis
ggbiplot(mtcarsplus.pca, obs.scale = 1, var.scale = 1, ellipse = TRUE, 
     labels=c(rownames(mtcars), "spacecar"), groups=mtcars.countryplus) +
  scale_colour_manual(name="Origin", values= c("forest green", "red3", "violet", "dark blue")) +
  ggtitle("PCA of mtcars dataset, with extra sample added") +
  theme(legend.position = "bottom")

# Alternative approach: Project the new sample onto the original PCA space
# This preserves the original structure and shows where the new sample would fall

# Scale the new sample using the original PCA's center
s.sc <- scale(t(spacecar[c(1:7,10,11)]), center= mtcars.pca$center)
# Apply the original PCA's rotation matrix to project the new sample
s.pred <- s.sc %*% mtcars.pca$rotation

# Create a new PCA object with the projected values
mtcars.plusproj.pca <- mtcars.pca
mtcars.plusproj.pca$x <- rbind(mtcars.plusproj.pca$x, s.pred)

# Visualize the PCA with the projected sample
# This approach is preferred when you want to maintain the original data structure
ggbiplot(mtcars.plusproj.pca, obs.scale = 1, var.scale = 1, ellipse = TRUE, 
     labels=c(rownames(mtcars), "spacecar"), groups=mtcars.countryplus) +
  scale_colour_manual(name="Origin", values= c("forest green", "red3", "violet", "dark blue")) +
  ggtitle("PCA of mtcars dataset, with extra sample projected") +
  theme(legend.position = "bottom")