Principal Component Analysis (PCA)

dr. Annelies Agten

2026-04-27

Principal Components Analysis (PCA) is a widely used technique for reducing the dimensionality of a dataset while preserving as much variation as possible. When working with many variables, it can be difficult to identify patterns or relationships directly. PCA addresses this by transforming the original variables into a new set of uncorrelated variables called principal components.

Each principal component is a linear combination of the original variables and is constructed to capture the maximum possible variance. The first principal component explains the largest amount of variation in the data, the second explains the next largest amount (while being uncorrelated with the first), and so on.

By focusing on the first few principal components, we can often summarize the structure of the data in a lower-dimensional space. This makes PCA a powerful tool for visualization, noise reduction, and exploratory data analysis.

In the following section, we will learn how to perform PCA in R and how to interpret the results.

PCA Analysis in R

The prcomp function serves as a great tool for PCA performance and calculates the principal components which are those variables that account for the most variation in a dataset. The basic syntax of the prcomp function is:

Syntax:
prcomp(x, center=TRUE,scale.=FALSE,rank.=NULL)

Where:

Read data

First we need to read in our data into R.Throughtout this example we will use the wine data. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines.

The attributes are:

The wine data is in a .txt format, so to read in the data we can use the read.table() function in R.

wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")

colnames(wine) <- c("Cultivar","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid phenols","Proanthocyanins","Color intensity","Hue","OD280/OD315 of diluted wines","Proline")

dim(wine)
#> [1] 178  14

head(wine, 5)
#>   Cultivar Alcohol Malic acid  Ash Alcalinity of ash Magnesium Total phenols
#> 1        1   14.23       1.71 2.43              15.6       127          2.80
#> 2        1   13.20       1.78 2.14              11.2       100          2.65
#> 3        1   13.16       2.36 2.67              18.6       101          2.80
#> 4        1   14.37       1.95 2.50              16.8       113          3.85
#> 5        1   13.24       2.59 2.87              21.0       118          2.80
#>   Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity  Hue
#> 1       3.06                 0.28            2.29            5.64 1.04
#> 2       2.76                 0.26            1.28            4.38 1.05
#> 3       3.24                 0.30            2.81            5.68 1.03
#> 4       3.49                 0.24            2.18            7.80 0.86
#> 5       2.69                 0.39            1.82            4.32 1.04
#>   OD280/OD315 of diluted wines Proline
#> 1                         3.92    1065
#> 2                         3.40    1050
#> 3                         3.17    1185
#> 4                         3.45    1480
#> 5                         2.93     735

The wine dataset contains 178 observations of 14 variables, including the 13 measured quantities of chemicals and the variable Cultivar, which indicates the type of grape from which the wine was produced. Since this is a categorical variable describing group membership rather than a measured chemical property, it is not included in the PCA. PCA is applied only to quantitative explanatory variables, as its goal is to summarize variation in the measured features. In this context, Cultivar can instead be used afterward to help interpret or visualize the resulting principal components.

Before performing PCA, it is important to check whether all variables are measured on the same scale. If not, the data should be standardized so that each variable contributes equally to the analysis.

wine_range <- data.frame(cbind(sapply(wine[,2:14],min),
                 sapply(wine[,2:14],max),
                 round(sapply(wine[,2:14],var),2))) 
names(wine_range) <- c("min value","max value","variance")

wine_range
#>                              min value max value variance
#> Alcohol                          11.03     14.83     0.66
#> Malic acid                        0.74      5.80     1.25
#> Ash                               1.36      3.23     0.08
#> Alcalinity of ash                10.60     30.00    11.15
#> Magnesium                        70.00    162.00   203.99
#> Total phenols                     0.98      3.88     0.39
#> Flavanoids                        0.34      5.08     1.00
#> Nonflavanoid phenols              0.13      0.66     0.02
#> Proanthocyanins                   0.41      3.58     0.33
#> Color intensity                   1.28     13.00     5.37
#> Hue                               0.48      1.71     0.05
#> OD280/OD315 of diluted wines      1.27      4.00     0.50
#> Proline                         278.00   1680.00 99166.72

The measured attributes have very different ranges, so the data should be standardized before performing PCA to ensure that all variables contribute equally to the analysis.

wine_stand <- as.data.frame(scale(wine)) # standardize data by subtracting the mean and deviding by the sd

wine_stand_range <- data.frame(cbind(sapply(wine_stand[,2:14],min),
                             sapply(wine_stand[,2:14],max),
                             round(sapply(wine_stand[,2:14],var),2)))
names(wine_stand_range) <- c("min value","max value","variance")

wine_stand_range
#>                              min value max value variance
#> Alcohol                      -2.427388  2.253415        1
#> Malic acid                   -1.428952  3.100446        1
#> Ash                          -3.668813  3.147447        1
#> Alcalinity of ash            -2.663505  3.145637        1
#> Magnesium                    -2.082381  4.359076        1
#> Total phenols                -2.101318  2.532372        1
#> Flavanoids                   -1.691200  3.054216        1
#> Nonflavanoid phenols         -1.862979  2.395645        1
#> Proanthocyanins              -2.063214  3.475269        1
#> Color intensity              -1.629691  3.425768        1
#> Hue                          -2.088840  3.292407        1
#> OD280/OD315 of diluted wines -1.889723  1.955399        1
#> Proline                      -1.488987  2.963114        1

Perform PCA analysis

The prcomp function is used to execute PCA on the wine data. Depending on whether you standaridized the data before analysis, you have to specify the right option for the scale. parameter.

wine.pca <- prcomp(wine_stand[,2:14]) # already standardized the data before
wine.pca <- prcomp(wine[,2:14],scale=T) # standardize data automatically

summary(wine.pca)
#> Importance of components:
#>                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
#> Standard deviation     2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
#> Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
#> Cumulative Proportion  0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
#>                            PC8     PC9   PC10    PC11    PC12    PC13
#> Standard deviation     0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
#> Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
#> Cumulative Proportion  0.92018 0.94240 0.9617 0.97907 0.99205 1.00000

We can see that the first principal component explains 36.2% of the total variance, the second principal component explains 19.21% of the total variance, etc… The first five PC’s retain 80.16% of the total variance in the data.

The output of prcomp() is an object that contains several useful components. The most important ones are:

wine.pca$sdev
#>  [1] 2.1692972 1.5801816 1.2025273 0.9586313 0.9237035 0.8010350 0.7423128
#>  [8] 0.5903367 0.5374755 0.5009017 0.4751722 0.4108165 0.3215244

head(wine.pca$rotation)
#>                            PC1         PC2         PC3         PC4         PC5
#> Alcohol           -0.144329395 -0.48365155 -0.20738262 -0.01785630  0.26566365
#> Malic acid         0.245187580 -0.22493093  0.08901289  0.53689028 -0.03521363
#> Ash                0.002051061 -0.31606881  0.62622390 -0.21417556  0.14302547
#> Alcalinity of ash  0.239320405  0.01059050  0.61208035  0.06085941 -0.06610294
#> Magnesium         -0.141992042 -0.29963400  0.13075693 -0.35179658 -0.72704851
#> Total phenols     -0.394660845 -0.06503951  0.14617896  0.19806835  0.14931841
#>                           PC6         PC7         PC8         PC9        PC10
#> Alcohol           -0.21353865 -0.05639636 -0.39613926 -0.50861912 -0.21160473
#> Malic acid        -0.53681385  0.42052391 -0.06582674  0.07528304  0.30907994
#> Ash               -0.15447466 -0.14917061  0.17026002  0.30769445  0.02712539
#> Alcalinity of ash  0.10082451 -0.28696914 -0.42797018 -0.20044931 -0.05279942
#> Magnesium         -0.03814394  0.32288330  0.15636143 -0.27140257 -0.06787022
#> Total phenols      0.08412230 -0.02792498  0.40593409 -0.28603452  0.32013135
#>                          PC11        PC12        PC13
#> Alcohol            0.22591696  0.26628645 -0.01496997
#> Malic acid        -0.07648554 -0.12169604 -0.02596375
#> Ash                0.49869142  0.04962237  0.14121803
#> Alcalinity of ash -0.47931378  0.05574287 -0.09168285
#> Magnesium         -0.07128891 -0.06222011 -0.05677422
#> Total phenols     -0.30434119  0.30388245  0.46390791

head(wine.pca$x)
#>            PC1        PC2        PC3        PC4        PC5        PC6
#> [1,] -3.307421 -1.4394023 -0.1652728 -0.2150246 -0.6910933 -0.2232504
#> [2,] -2.203250  0.3324551 -2.0207571 -0.2905387  0.2569299 -0.9245123
#> [3,] -2.509661 -1.0282507  0.9800541  0.7228632  0.2503270  0.5477310
#> [4,] -3.746497 -2.7486184 -0.1756962  0.5663856  0.3109644  0.1141091
#> [5,] -1.006070 -0.8673840  2.0209873 -0.4086131 -0.2976180 -0.4053761
#> [6,] -3.041674 -2.1164309 -0.6276254 -0.5141870  0.6302409  0.1230834
#>              PC7         PC8         PC9        PC10       PC11          PC12
#> [1,]  0.59474883  0.06495586 -0.63963836 -1.01808396  0.4502932 -0.5392891439
#> [2,]  0.05362434  1.02153432  0.30797798 -0.15925214  0.1422560 -0.3871456499
#> [3,]  0.42301218 -0.34324787  1.17452129 -0.11304198  0.2858665 -0.0005819316
#> [4,] -0.38225899  0.64178311 -0.05239662 -0.23873915 -0.7574476  0.2413387757
#> [5,]  0.44282531  0.41552831 -0.32589984  0.07814604  0.5244656  0.2160546934
#> [6,]  0.40052393  0.39378261  0.15171810  0.10170891 -0.4044444  0.3783653606
#>              PC13
#> [1,]  0.066052305
#> [2,] -0.003626273
#> [3,] -0.021655423
#> [4,]  0.368444194
#> [5,]  0.079140320
#> [6,] -0.144747017

Visualize results of PCA

The x component of the prcomp() output contains the scores, i.e. the coordinates of each observation in the principal component space. These scores can be used to create scatterplots of the first few principal components (e.g., PC1 vs PC2), which often reveal patterns or structure in the data.

plot(wine.pca$x[,1],wine.pca$x[,2],pch=19, xlab="PC1", ylab='PC2', main='Scatterplot PC1 vs PC2')
plot(wine.pca$x[,1],wine.pca$x[,3],pch=19, xlab="PC1", ylab='PC3', main='Scatterplot PC1 vs PC3') 
plot(wine.pca$x[,2],wine.pca$x[,3],pch=19, xlab="PC2", ylab='PC3', main='Scatterplot PC2 vs PC3') 

By overlaying or coloring the points according to the Cultivar variable, we can visually assess whether the different wine types form distinct groups based on the principal components.

plot(wine.pca$x[,1],wine.pca$x[,2],pch=19, xlab="PC1", ylab='PC2', main='Scatterplot PC1 vs PC2',col=wine$Cultivar) # make a scatterplot
legend('topright',legend=c("Class 1", "Class 2", "Class 3"),col=1:3, pch=19)
plot(wine.pca$x[,1],wine.pca$x[,3],pch=19, xlab="PC1", ylab='PC3', main='Scatterplot PC1 vs PC3',col=wine$Cultivar) # make a scatterplot
legend('topright',legend=c("Class 1", "Class 2", "Class 3"),col=1:3, pch=19)
plot(wine.pca$x[,2],wine.pca$x[,3],pch=19, xlab="PC2", ylab='PC3', main='Scatterplot PC2 vs PC3',col=wine$Cultivar) # make a scatterplot
legend('topright',legend=c("Class 1", "Class 2", "Class 3"),col=1:3, pch=19)

How many components to retain?

To determine how many principal components to retain, a common approach is to examine a scree plot. This plot displays the amount of variance explained by each principal component, typically in decreasing order. By inspecting the plot, we look for an “elbow” point where the explained variance begins to level off. The components before this point capture most of the important structure in the data, while those after it contribute relatively little and can often be discarded.

There are several ways to produce a scree plot in R, the simplest being the plot() function of the prcomp output results.

plot(wine.pca) # scree plot

Another option is provided by the screeplot() function:

screeplot(wine.pca, type="lines", main='Scree plot',pch=19,cex=1.5)
abline(h=1,lty='dotted',col='red') # add a horizontal line for eigenvalue = 1

Biplot

A biplot is a graphical representation of PCA that displays both the observations and the variables in the same plot. The observations are shown as points using their scores (from the x matrix), while the original variables are represented as arrows based on their loadings.

This combined visualization allows us to interpret the principal components more easily. The direction and length of the arrows indicate how strongly each variable contributes to the components, while the position of the points shows similarities or differences between observations. Additionally, the angles between arrows provide insight into correlations between variables: small angles indicate strong positive correlation, angles close to 180° indicate negative correlation, and angles around 90° suggest little or no correlation.

There are several ways to create a biplot in R, some of which require functions from specific packages. In this section, we will illustrate three common approaches: the base R biplot() function, the autoplot() function (from the ggplot2 ecosystem via the ggfortify package), and the ggbiplot() function from the ggbiplot package. Each method offers different advantages in terms of customization and visualization style.

Note that if a package has not been used before, it must first be installed using the install.packages()function. This only needs to be done once on a given system. After installation, the package must be loaded into the R session using thelibrary()` function each time you start R or want to use the package. Only after loading the package will its functions be available for use.

install.packages('ggplot2')
install.packages('ggfortify')
install.packages('ggbiplot')
biplot(wine.pca,col=c('black','red'),pch=19)

wine$Cultivar <- as.factor(wine$Cultivar) #CUltivar is a categorical variable

# Create a biplot using ggplot2 and ggfortify

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.5.3
library(ggfortify)
#> Warning: package 'ggfortify' was built under R version 4.5.3

autoplot(wine.pca, data = wine, colour = 'Cultivar',
         loadings = TRUE, loadings.label = TRUE, loadings.label.size = 4)
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> ℹ The deprecated feature was likely used in the ggfortify package.
#>   Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.

# Create a biplot using ggbiplot

library(ggbiplot)
#> Warning: package 'ggbiplot' was built under R version 4.5.3
#> 
#> Attaching package: 'ggbiplot'
#> The following object is masked _by_ '.GlobalEnv':
#> 
#>     wine
#> The following object is masked from 'package:ggfortify':
#> 
#>     ggbiplot

ggbiplot(wine.pca,groups=wine$Class)