Principal Components Analysis (PCA) is a widely used technique for reducing the dimensionality of a dataset while preserving as much variation as possible. When working with many variables, it can be difficult to identify patterns or relationships directly. PCA addresses this by transforming the original variables into a new set of uncorrelated variables called principal components.
Each principal component is a linear combination of the original variables and is constructed to capture the maximum possible variance. The first principal component explains the largest amount of variation in the data, the second explains the next largest amount (while being uncorrelated with the first), and so on.
By focusing on the first few principal components, we can often summarize the structure of the data in a lower-dimensional space. This makes PCA a powerful tool for visualization, noise reduction, and exploratory data analysis.
In the following section, we will learn how to perform PCA in R and how to interpret the results.
The prcomp function serves as a great tool for PCA performance and calculates the principal components which are those variables that account for the most variation in a dataset. The basic syntax of the prcomp function is:
Syntax:
prcomp(x, center=TRUE,scale.=FALSE,rank.=NULL)Where:
- x: A numeric matrix or data frame to be analyzed.
- center: logical value indicating whether the variables should be centered to have mean zero. Default is TRUE.
- scale.: A logical value indicating whether the variables should be scaled to have unit variance. Default is FALSE.
- rank.: The number of principal components to retain. If NULL, all components are retained.
First we need to read in our data into R.Throughtout this example we
will use the wine data. These data are the results of a
chemical analysis of wines grown in the same region in Italy but derived
from three different cultivars. The analysis determined the quantities
of 13 constituents found in each of the three types of wines.
The attributes are:
The wine data is in a .txt format, so to read in the
data we can use the read.table() function in R.
wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data", sep=",")
colnames(wine) <- c("Cultivar","Alcohol","Malic acid","Ash","Alcalinity of ash","Magnesium","Total phenols","Flavanoids","Nonflavanoid phenols","Proanthocyanins","Color intensity","Hue","OD280/OD315 of diluted wines","Proline")
dim(wine)
#> [1] 178 14
head(wine, 5)
#> Cultivar Alcohol Malic acid Ash Alcalinity of ash Magnesium Total phenols
#> 1 1 14.23 1.71 2.43 15.6 127 2.80
#> 2 1 13.20 1.78 2.14 11.2 100 2.65
#> 3 1 13.16 2.36 2.67 18.6 101 2.80
#> 4 1 14.37 1.95 2.50 16.8 113 3.85
#> 5 1 13.24 2.59 2.87 21.0 118 2.80
#> Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue
#> 1 3.06 0.28 2.29 5.64 1.04
#> 2 2.76 0.26 1.28 4.38 1.05
#> 3 3.24 0.30 2.81 5.68 1.03
#> 4 3.49 0.24 2.18 7.80 0.86
#> 5 2.69 0.39 1.82 4.32 1.04
#> OD280/OD315 of diluted wines Proline
#> 1 3.92 1065
#> 2 3.40 1050
#> 3 3.17 1185
#> 4 3.45 1480
#> 5 2.93 735The wine dataset contains 178 observations of 14
variables, including the 13 measured quantities of chemicals and the
variable Cultivar, which indicates the type of grape from which the wine
was produced. Since this is a categorical variable describing group
membership rather than a measured chemical property, it is not included
in the PCA. PCA is applied only to quantitative explanatory variables,
as its goal is to summarize variation in the measured features. In this
context, Cultivar can instead be used afterward to help interpret or
visualize the resulting principal components.
Before performing PCA, it is important to check whether all variables are measured on the same scale. If not, the data should be standardized so that each variable contributes equally to the analysis.
wine_range <- data.frame(cbind(sapply(wine[,2:14],min),
sapply(wine[,2:14],max),
round(sapply(wine[,2:14],var),2)))
names(wine_range) <- c("min value","max value","variance")
wine_range
#> min value max value variance
#> Alcohol 11.03 14.83 0.66
#> Malic acid 0.74 5.80 1.25
#> Ash 1.36 3.23 0.08
#> Alcalinity of ash 10.60 30.00 11.15
#> Magnesium 70.00 162.00 203.99
#> Total phenols 0.98 3.88 0.39
#> Flavanoids 0.34 5.08 1.00
#> Nonflavanoid phenols 0.13 0.66 0.02
#> Proanthocyanins 0.41 3.58 0.33
#> Color intensity 1.28 13.00 5.37
#> Hue 0.48 1.71 0.05
#> OD280/OD315 of diluted wines 1.27 4.00 0.50
#> Proline 278.00 1680.00 99166.72The measured attributes have very different ranges, so the data should be standardized before performing PCA to ensure that all variables contribute equally to the analysis.
wine_stand <- as.data.frame(scale(wine)) # standardize data by subtracting the mean and deviding by the sd
wine_stand_range <- data.frame(cbind(sapply(wine_stand[,2:14],min),
sapply(wine_stand[,2:14],max),
round(sapply(wine_stand[,2:14],var),2)))
names(wine_stand_range) <- c("min value","max value","variance")
wine_stand_range
#> min value max value variance
#> Alcohol -2.427388 2.253415 1
#> Malic acid -1.428952 3.100446 1
#> Ash -3.668813 3.147447 1
#> Alcalinity of ash -2.663505 3.145637 1
#> Magnesium -2.082381 4.359076 1
#> Total phenols -2.101318 2.532372 1
#> Flavanoids -1.691200 3.054216 1
#> Nonflavanoid phenols -1.862979 2.395645 1
#> Proanthocyanins -2.063214 3.475269 1
#> Color intensity -1.629691 3.425768 1
#> Hue -2.088840 3.292407 1
#> OD280/OD315 of diluted wines -1.889723 1.955399 1
#> Proline -1.488987 2.963114 1The prcomp function is used to execute PCA on the
wine data. Depending on whether you standaridized the data
before analysis, you have to specify the right option for the
scale. parameter.
wine.pca <- prcomp(wine_stand[,2:14]) # already standardized the data before
wine.pca <- prcomp(wine[,2:14],scale=T) # standardize data automatically
summary(wine.pca)
#> Importance of components:
#> PC1 PC2 PC3 PC4 PC5 PC6 PC7
#> Standard deviation 2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
#> Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
#> Cumulative Proportion 0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
#> PC8 PC9 PC10 PC11 PC12 PC13
#> Standard deviation 0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
#> Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
#> Cumulative Proportion 0.92018 0.94240 0.9617 0.97907 0.99205 1.00000We can see that the first principal component explains 36.2% of the total variance, the second principal component explains 19.21% of the total variance, etc… The first five PC’s retain 80.16% of the total variance in the data.
The output of prcomp() is an object that contains
several useful components. The most important ones are:
wine.pca$sdev
#> [1] 2.1692972 1.5801816 1.2025273 0.9586313 0.9237035 0.8010350 0.7423128
#> [8] 0.5903367 0.5374755 0.5009017 0.4751722 0.4108165 0.3215244
head(wine.pca$rotation)
#> PC1 PC2 PC3 PC4 PC5
#> Alcohol -0.144329395 -0.48365155 -0.20738262 -0.01785630 0.26566365
#> Malic acid 0.245187580 -0.22493093 0.08901289 0.53689028 -0.03521363
#> Ash 0.002051061 -0.31606881 0.62622390 -0.21417556 0.14302547
#> Alcalinity of ash 0.239320405 0.01059050 0.61208035 0.06085941 -0.06610294
#> Magnesium -0.141992042 -0.29963400 0.13075693 -0.35179658 -0.72704851
#> Total phenols -0.394660845 -0.06503951 0.14617896 0.19806835 0.14931841
#> PC6 PC7 PC8 PC9 PC10
#> Alcohol -0.21353865 -0.05639636 -0.39613926 -0.50861912 -0.21160473
#> Malic acid -0.53681385 0.42052391 -0.06582674 0.07528304 0.30907994
#> Ash -0.15447466 -0.14917061 0.17026002 0.30769445 0.02712539
#> Alcalinity of ash 0.10082451 -0.28696914 -0.42797018 -0.20044931 -0.05279942
#> Magnesium -0.03814394 0.32288330 0.15636143 -0.27140257 -0.06787022
#> Total phenols 0.08412230 -0.02792498 0.40593409 -0.28603452 0.32013135
#> PC11 PC12 PC13
#> Alcohol 0.22591696 0.26628645 -0.01496997
#> Malic acid -0.07648554 -0.12169604 -0.02596375
#> Ash 0.49869142 0.04962237 0.14121803
#> Alcalinity of ash -0.47931378 0.05574287 -0.09168285
#> Magnesium -0.07128891 -0.06222011 -0.05677422
#> Total phenols -0.30434119 0.30388245 0.46390791
head(wine.pca$x)
#> PC1 PC2 PC3 PC4 PC5 PC6
#> [1,] -3.307421 -1.4394023 -0.1652728 -0.2150246 -0.6910933 -0.2232504
#> [2,] -2.203250 0.3324551 -2.0207571 -0.2905387 0.2569299 -0.9245123
#> [3,] -2.509661 -1.0282507 0.9800541 0.7228632 0.2503270 0.5477310
#> [4,] -3.746497 -2.7486184 -0.1756962 0.5663856 0.3109644 0.1141091
#> [5,] -1.006070 -0.8673840 2.0209873 -0.4086131 -0.2976180 -0.4053761
#> [6,] -3.041674 -2.1164309 -0.6276254 -0.5141870 0.6302409 0.1230834
#> PC7 PC8 PC9 PC10 PC11 PC12
#> [1,] 0.59474883 0.06495586 -0.63963836 -1.01808396 0.4502932 -0.5392891439
#> [2,] 0.05362434 1.02153432 0.30797798 -0.15925214 0.1422560 -0.3871456499
#> [3,] 0.42301218 -0.34324787 1.17452129 -0.11304198 0.2858665 -0.0005819316
#> [4,] -0.38225899 0.64178311 -0.05239662 -0.23873915 -0.7574476 0.2413387757
#> [5,] 0.44282531 0.41552831 -0.32589984 0.07814604 0.5244656 0.2160546934
#> [6,] 0.40052393 0.39378261 0.15171810 0.10170891 -0.4044444 0.3783653606
#> PC13
#> [1,] 0.066052305
#> [2,] -0.003626273
#> [3,] -0.021655423
#> [4,] 0.368444194
#> [5,] 0.079140320
#> [6,] -0.144747017The x component of the prcomp() output
contains the scores, i.e. the coordinates of each observation in the
principal component space. These scores can be used to create
scatterplots of the first few principal components (e.g., PC1 vs PC2),
which often reveal patterns or structure in the data.
plot(wine.pca$x[,1],wine.pca$x[,2],pch=19, xlab="PC1", ylab='PC2', main='Scatterplot PC1 vs PC2')
plot(wine.pca$x[,1],wine.pca$x[,3],pch=19, xlab="PC1", ylab='PC3', main='Scatterplot PC1 vs PC3')
plot(wine.pca$x[,2],wine.pca$x[,3],pch=19, xlab="PC2", ylab='PC3', main='Scatterplot PC2 vs PC3') By overlaying or coloring the points according to the Cultivar variable, we can visually assess whether the different wine types form distinct groups based on the principal components.
plot(wine.pca$x[,1],wine.pca$x[,2],pch=19, xlab="PC1", ylab='PC2', main='Scatterplot PC1 vs PC2',col=wine$Cultivar) # make a scatterplot
legend('topright',legend=c("Class 1", "Class 2", "Class 3"),col=1:3, pch=19)
plot(wine.pca$x[,1],wine.pca$x[,3],pch=19, xlab="PC1", ylab='PC3', main='Scatterplot PC1 vs PC3',col=wine$Cultivar) # make a scatterplot
legend('topright',legend=c("Class 1", "Class 2", "Class 3"),col=1:3, pch=19)
plot(wine.pca$x[,2],wine.pca$x[,3],pch=19, xlab="PC2", ylab='PC3', main='Scatterplot PC2 vs PC3',col=wine$Cultivar) # make a scatterplot
legend('topright',legend=c("Class 1", "Class 2", "Class 3"),col=1:3, pch=19)To determine how many principal components to retain, a common approach is to examine a scree plot. This plot displays the amount of variance explained by each principal component, typically in decreasing order. By inspecting the plot, we look for an “elbow” point where the explained variance begins to level off. The components before this point capture most of the important structure in the data, while those after it contribute relatively little and can often be discarded.
There are several ways to produce a scree plot in R, the simplest
being the plot() function of the prcomp output
results.
Another option is provided by the screeplot()
function:
screeplot(wine.pca, type="lines", main='Scree plot',pch=19,cex=1.5)
abline(h=1,lty='dotted',col='red') # add a horizontal line for eigenvalue = 1A biplot is a graphical representation of PCA that displays both the
observations and the variables in the same plot. The observations are
shown as points using their scores (from the x matrix),
while the original variables are represented as arrows based on their
loadings.
This combined visualization allows us to interpret the principal components more easily. The direction and length of the arrows indicate how strongly each variable contributes to the components, while the position of the points shows similarities or differences between observations. Additionally, the angles between arrows provide insight into correlations between variables: small angles indicate strong positive correlation, angles close to 180° indicate negative correlation, and angles around 90° suggest little or no correlation.
There are several ways to create a biplot in R, some of which require functions from specific packages. In this section, we will illustrate three common approaches: the base R biplot() function, the autoplot() function (from the ggplot2 ecosystem via the ggfortify package), and the ggbiplot() function from the ggbiplot package. Each method offers different advantages in terms of customization and visualization style.
Note that if a package has not been used before, it must first be
installed using the
install.packages()function. This only needs to be done once on a given system. After installation, the package must be loaded into the R session using thelibrary()`
function each time you start R or want to use the package. Only after
loading the package will its functions be available for use.
wine$Cultivar <- as.factor(wine$Cultivar) #CUltivar is a categorical variable
# Create a biplot using ggplot2 and ggfortify
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.5.3
library(ggfortify)
#> Warning: package 'ggfortify' was built under R version 4.5.3
autoplot(wine.pca, data = wine, colour = 'Cultivar',
loadings = TRUE, loadings.label = TRUE, loadings.label.size = 4)
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> ℹ The deprecated feature was likely used in the ggfortify package.
#> Please report the issue at <https://github.com/sinhrks/ggfortify/issues>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.# Create a biplot using ggbiplot
library(ggbiplot)
#> Warning: package 'ggbiplot' was built under R version 4.5.3
#>
#> Attaching package: 'ggbiplot'
#> The following object is masked _by_ '.GlobalEnv':
#>
#> wine
#> The following object is masked from 'package:ggfortify':
#>
#> ggbiplot
ggbiplot(wine.pca,groups=wine$Class)