Principal Component Analysis

Author

Roshan Saha

What is Principal component analysis?

Principal Component Analysis (PCA) is a widely used technique for reducing the dimensionality of multivariate data by transforming a large set of correlated variables into a smaller set of uncorrelated variables called principal components. The first few principal components are constructed to capture as much of the variance in the original data as possible, making subsequent analysis simpler and interpretation easier. While this involves some loss of detail, most of the important patterns are preserved, facilitating exploration, visualization, and further modeling.

Why do we need it?

PCA has been used extensively in economics and social sciences to construct comprehensive indices (Ghosh, Bhowmick, and Saha 2019).

Example using data on honey bee behavior

Inspired by recent studies on honey bee behavior and their social networks—such as those by Wild et al. (2021) and Smith et al. (2022), which provide replication packages—I decided to use a hypothetical dataset to demonstrate the concept of PCA. These bee studies show that individual bees exhibit distinct behaviors at different stages of their life. With multivariate recordings of bee behaviors (for example, the time spent on brood care, foraging activities, and honey processing), PCA helps uncover the main behavioral axes or trends in the colony.

The central idea is that, by applying PCA to these behavioral variables, we can calculate a principal component score for each bee. This score gives us an objective summary—a “location” or position in behavioral space—indicating where each bee stands in relation to the main behavioral patterns observed in the colony. Thus, PCA functions as a powerful tool to explore, summarize, and compare individual bee behavior over time.

Load sample data

In this case it is just a hypothetical dataset that captures behavior patterns of honey bee (N =3).

# Bee behavioral data (Brood Time, Dance Floor Time)
bee <- data.frame(
  Brood = c(0.8, 0.6, 0.2),
  Dance = c(0.2, 0.3, 0.7),
  Honey = c(0.1, 0.5, 0.8),
  row.names = c('A', 'B', 'C')
)
bee

  Brood Dance Honey
A   0.8   0.2   0.1
B   0.6   0.3   0.5
C   0.2   0.7   0.8

Manual calculation of PCA

Standardize it

bee_std <- scale(bee, center = TRUE, scale = TRUE)
bee_std

       Brood      Dance      Honey
A  0.8728716 -0.7559289 -1.0440738
B  0.2182179 -0.3779645  0.0949158
C -1.0910895  1.1338934  0.9491580
attr(,"scaled:center")
    Brood     Dance     Honey 
0.5333333 0.4000000 0.4666667 
attr(,"scaled:scale")
    Brood     Dance     Honey 
0.3055050 0.2645751 0.3511885

calculate the covariance matrix and eigen-decomposition

cov_mat <- cov(bee_std)
cov_mat

           Brood      Dance      Honey
Brood  1.0000000 -0.9897433 -0.9631231
Dance -0.9897433  1.0000000  0.9148074
Honey -0.9631231  0.9148074  1.0000000

# Eigen-decomposition
eig <- eigen(cov_mat)
eig$values      # eigenvalues

[1]  2.912116e+00  8.788435e-02 -1.848521e-16

eig$vectors     # eigenvectors

           [,1]       [,2]       [,3]
[1,]  0.5854870  0.1407577 -0.7983685
[2,] -0.5759886 -0.6207823 -0.5318518
[3,] -0.5704753  0.7712435 -0.2823852

The eig$values show how much variance each principal component captures, and eig$vectors are direction vectors. The first column is PC1- the direction along which the data varies the most.

Calculate PC1 score for each bee (manually). This step essentially conducting a matrix multiplication to get the PC1 scores. bee_std is the data matrix (standardized) of bee behavior, %*% is the matrix multiplication operator in R, and eig$vectors[,1] is the column vector (of PC1) from the eigen-decomposition. If we specified [..,2] or [..,3] we would get the PC2 and PC3 scores, respectively. We compute the dot product of each bee’s behavior with the PC1 weights.

pc1 <- bee_std %*% eig$vectors[,1]
pc1

        [,1]
A  1.5420797
B  0.2913199
C -1.8333996

The above code essentially performs the following mathematical steps.

Mathematical Steps

1. Data Matrix and Centering

Let $X$ be your data matrix of $n$ samples (bees) and $p$ variables (behaviors).

$X_{centered} = X - \text{mean}(X)$

2. Covariance Matrix

The covariance matrix SS summarizes how your behavioral variables co-vary across bees:

$S = \frac{1}{n-1} X_{centered}^T X_{centered}$

3. Eigen Decomposition

Solve for eigenvalues $\lambda$ eigenvectors $w$:

$Sw=λw$

Or equivalently,

$det⁡(S−λI)=0$

The solutions $\lambda_1, \lambda_2 ... \lambda_p$ (where $p$ is number of features/variables) are the eigenvalues of $S$. There are as many eigenvalues/eigenvectors as the number of features/variables. The first principal component (PC1) is the eigenvector associated with the largest eigenvalue.

Each eigenvector is a direction (principal axis); its eigenvalue is the variance explained by this direction.

4. Principal Component Scores

Project each bee onto the new axes (scores):

$t_k=X_{centered} w_k$

This gives the coordinates of each bee along principal component $k$.

Calculate PCA in R

Calculate PC1 score using inbuilt R functions

pca_res <- prcomp(bee, center = TRUE, scale. = TRUE)
summary(pca_res)      # variance explained

Importance of components:
                          PC1     PC2       PC3
Standard deviation     1.7065 0.29645 1.389e-16
Proportion of Variance 0.9707 0.02929 0.000e+00
Cumulative Proportion  0.9707 1.00000 1.000e+00

pca_res$rotation      # loadings (eigenvectors) positive values will indicate bees with higher PC1 values correspond to that behavior and loading with negative values imply that bees with lower PC1 values with correspond to that behavior.

             PC1        PC2       PC3
Brood -0.5854870 -0.1407577 0.7983685
Dance  0.5759886  0.6207823 0.5318518
Honey  0.5704753 -0.7712435 0.2823852

The loadings tell us how much each original feature contributes to each principal component. From above we can say that if PC1 scores are low then the bee most likely participates in Brooding and less in Dance/ Honey. Your PC scores for each observation (bee, individual, etc.) tell you where it sits along the new axes discovered by PCA.

Only by inspecting the loadings table can you tell, what a high PC score implies.

Calculate the PCA for each bee (the scores are the same as we got earlier).

pca_res$x             # projected coordinates (PC scores)

         PC1        PC2           PC3
A -1.5420797  0.2131044 -5.551115e-17
B -0.2913199 -0.3385527  5.551115e-17
C  1.8333996  0.1254483  2.775558e-16

Visually plotting it

plot(pca_res$x, col = "blue", pch = 19,
     xlab = "PC1", ylab = "PC2", main = "Bee Behavior PCA")
text(pca_res$x, labels = rownames(bee), pos = 3)

biplot(pca_res)

Let’s repeat the exercise with another hypothetical data where the number of bees is 1000.

set.seed(42) # for reproducibility
n_bees <- 1000
bee2 <- data.frame(
  Brood = pmax(rnorm(n_bees, mean=0.5, sd=0.2), 0),
  Dance = pmax(rnorm(n_bees, mean=0.3, sd=0.15), 0),
  Honey = pmax(rnorm(n_bees, mean=0.4, sd=0.2), 0)
)
# Normalize so each bee's values sum to 1 (as fractions of time)
row_sums <- rowSums(bee2)
bee2 <- bee2 / row_sums

# See the first few rows
head(bee2)

      Brood     Dance      Honey
1 0.4133286 0.3463619 0.24030952
2 0.3486735 0.3410687 0.31025771
3 0.5335245 0.4151819 0.05129362
4 0.6373316 0.3626684 0.00000000
5 0.6652758 0.1724999 0.16222427
6 0.4119132 0.1809979 0.40708886

# PCA
pca_res2 <- prcomp(bee2, center = TRUE, scale. = TRUE)

# View variance explained
summary(pca_res2)

Importance of components:
                          PC1    PC2       PC3
Standard deviation     1.2866 1.1595 6.303e-16
Proportion of Variance 0.5518 0.4482 0.000e+00
Cumulative Proportion  0.5518 1.0000 1.000e+00

# View loadings (feature contributions to PC1/PC2)
round(pca_res2$rotation, 3)

         PC1    PC2   PC3
Brood  0.678 -0.422 0.602
Dance  0.083  0.858 0.508
Honey -0.731 -0.294 0.616

# projected coordinates
#pca_res2$x

# PCA plot
plot(pca_res2$x[,1:2], pch=19, col=rgb(0,0,1,0.3),
     xlab="PC1", ylab="PC2", main="PCA of Synthetic Bee Behavioral Data")
# Optionally, color by highest feature (likely == behavior type)
col_behavior <- apply(bee2, 1, function(x) which.max(x))
points(pca_res$x[,1:2], col=col_behavior, pch=19)
legend('topright', legend=c('Brood','Dance','Honey'), col=1:3, pch=19)

biplot(pca_res2)

# Show sample bees
bee2[sample(1:n_bees, 5), ]

        Brood     Dance     Honey
793 0.7408302 0.1554068 0.1037630
963 0.4226978 0.2661273 0.3111749
711 0.4933847 0.1672349 0.3393804
671 0.3749236 0.3043983 0.3206781
910 0.1953316 0.3444244 0.4602441

# For each bee, assign color: 1 = Brood (red), 2 = Dance (green), 3 = Honey (blue)
bee_group <- apply(bee2, 1, which.max)
bee_colors <- c('red', 'green', 'blue')[bee_group]

# Plot base PCA scores
plot(pca_res2$x[,1:2],
     xlab="PC1", ylab="PC2", main="PCA of Synthetic Bee Behavioral Data",
     pch=19, col=bee_colors)

legend('topright', legend=c('Brood','Dance','Honey'), col=c('red','green','blue'), pch=19)

References

Ghosh, Nilanjan, Soumya Bhowmick, and Roshan Saha. 2019. “SDG Index and Ease of Doing Business in India: A Sub-National Study.” ORF Occasional Paper 199 (June).

Smith, Michael L., Jacob D. Davidson, Benjamin Wild, David M. Dormagen, Tim Landgraf, and Iain D. Couzin. 2022. “Behavioral Variation Across the Days and Lives of Honey Bees.” iScience 25 (9): 104842. https://doi.org/10.1016/j.isci.2022.104842.

Wild, Benjamin, David M. Dormagen, Adrian Zachariae, Michael L. Smith, Kirsten S. Traynor, Dirk Brockmann, Iain D. Couzin, and Tim Landgraf. 2021. “Social Networks Predict the Life and Death of Honey Bees.” Nature Communications 12 (1): 1110. https://doi.org/10.1038/s41467-021-21212-5.