With the flood of available information today, data sets are often comprised of many variables. Although more information may allow for a better overall understanding of a data set, it often leads to complications. For example, an analyst may have difficulty determining insignificant variables, avoiding multicollinearity, capturing relationships between more than two or three variables and representing those relationships pictorially. Fortunately, Principal Component Analysis can be used to address these potential problems.
The role of Principal Component Analysis (PCA) is to address the above concerns by
The strongest relationships among variables, defined as principal components, are those with the greatest variance. The selected principal components produce a redefined set of axes that capture information more efficiently than the previous variables and their axes. Relationships with little variance do not enhance the understanding of the data set, and so their exclusion results in a limited loss of information.
In essence, PCA compresses the data set through dimension (or variable) reduction, while maintaining pertinent information.
In order to visualize PCA, consider the simple data set of two variables shown below in Figure 1.
As previously mentioned, greater variance coincides with more information. The greatest variance in this data occurs along the diagonal blue line, pictured in Figure 2. To confirm this, Figure 3 shows each data point mapped to this line, revealing a well spread data set. In fact, there are no other lines that would generate a greater spread, making this line the first principal component.
The principal components must span the entire data space, requiring the second principal component to be perpendicular to the first. The red line, pictured in Figure 4, fulfills this requirement, and Figure 5 reveals that projecting the data points onto this line results in less variance than is represented in the first principal component.
As seen in Figure 6, the two principal components are perpendicular, span the whole data set and account for the greatest amount of variance.
The principal components can now form a new set of axes that better fit the data, thus facilitating visualization of the variance among the data points. The plot below shows Figure 6 rotated so that the principal components form the new axes. If the second principal component, represented by the red line, does not account for much variation, then this data set could be reduced down to a one-dimensional set, in which each point was mapped onto the first principal component.
Whereas the previous example relied on visualization to determine the principal components, calculations are necessary when performing real analyses. Before continuing, some terminology will be introduced.
Principal components are also referred to as eigenvectors, as they indicate a direction. Each eigenvector is paired with an eigenvalue, a measurement of the variance along the eigenvector. Eigenvectors with the greatest eigenvalues are considered significant and are labeled as principal components.
A data set of m dimensions must contain m principal components, hence why the previous example, of two dimensions, only required two principal components.
A data set from the car-review television show, Top Gear, will be used to go through the steps for PCA. The source code for the data set can be found here.
The following packages are required to reproduce the results in this analysis.
# Load packages
library(robustHD) # To load the TopGear data set
library(tidyverse) # For data manipulation and organization
library(grid) # For plot layouts
library(gridExtra) # For plot layouts
library(plotly) # For interactive and 3-dimensional graphs
library(gplots) # For heatmap
The TopGear
data set contains 297 different cars and 32 attributes. For the purpose of this analysis, only Maker
, Model
, MPG
and dimensional attributes (Weight
, Height
, Length
and Width
) will be selected. Weight
is measured in kilograms and Height
, Length
and Width
are measured in millimeters. Cars with any of these values missing or with infeasible mpgs (due to entry error) are removed. A new variable, Car
, is created in order to combine Maker
and Model
.
# Load data set
data(TopGear)
# Create data frame with specified variables
TG_Complete <- TopGear %>%
select(Maker, Model, Weight, Length, Width, Height, MPG) %>%
na.omit() %>%
filter(MPG < 235) %>%
mutate(Car = paste(Maker, Model, sep = "_"))
# Create data frame with only numeric variables
TG_Data <- select(TG_Complete, Weight, Length, Width, Height, MPG)
# View `Car` and numeric variables
head(TG_Complete[,c(8, 3:7)])
## Car Weight Length Width Height MPG
## 1 Alfa Romeo_Giulietta 1385 4351 1798 1465 64
## 2 Alfa Romeo_MiTo 1090 4063 1720 1446 49
## 3 Aston Martin_Cygnet 988 3078 1680 1500 56
## 4 Aston Martin_V12 Zagato 1680 4385 1865 1250 17
## 5 Aston Martin_Vanquish 1739 4720 1910 1294 19
## 6 Aston Martin_Vantage 1630 4385 1865 1260 20
The first step in PCA is to standardize the data by subtracting the column mean from each value. This ensures that attributes and features have the same scale. The scale
function is looped over the columns and the values are stored in a data frame. The first six rows of the scaled entries are displayed below.
# Create data frame with centered variables
scaled_TG <- apply(TG_Data, 2, scale)
head(scaled_TG)
## Weight Length Width Height MPG
## [1,] -0.4046200 -0.2288006 -0.2012746 -0.3378505 1.1714914
## [2,] -1.1334710 -0.8805554 -1.0442430 -0.4618773 0.2325394
## [3,] -1.3854805 -3.1096474 -1.4765346 -0.1093799 0.6707170
## [4,] 0.3242310 -0.1518573 0.5228137 -1.7413124 -1.7705584
## [5,] 0.4700012 0.6062603 1.0091417 -1.4540923 -1.6453647
## [6,] 0.2006969 -0.1518573 0.5228137 -1.6760351 -1.5827679
Weight
, Length
, Width
and Height
are positively related. Variables Weight
, Length
and Width
are each negatively related to MPG
, whereas Height
and MPG
have a positive relationship.
# Calculate and display covariance matrix
(cov_TG <- cov(scaled_TG))
## Weight Length Width Height MPG
## Weight 1.0000000 0.8651127 0.8199954 0.3258058 -0.5741026
## Length 0.8651127 1.0000000 0.8375446 0.1557690 -0.5206565
## Width 0.8199954 0.8375446 1.0000000 0.1070175 -0.5854857
## Height 0.3258058 0.1557690 0.1070175 1.0000000 0.1374961
## MPG -0.5741026 -0.5206565 -0.5854857 0.1374961 1.0000000
The eigenvectors and eigenvalues are now calculated. A nice explanation about covariance matrices and eigenvectors can be found here.
# Calculate eigenvalues and eigenvectors
eigen_TG <- eigen(cov_TG)
# View first 6 rows of eigenvalues and eigenvectors
head(eigen_TG)
## $values
## [1] 3.1548078 1.1214118 0.4509107 0.1691783 0.1036914
##
## $vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5319255 0.15612322 0.0351903 0.3445670 0.75678006
## [2,] 0.5211818 0.02535945 0.3900024 0.4634432 -0.60070372
## [3,] 0.5199298 -0.05610290 0.2712467 -0.8080524 0.00142446
## [4,] 0.1183135 0.88766514 -0.3826166 -0.1142862 -0.19645795
## [5,] -0.4013724 0.42881874 0.7916372 -0.0219995 0.16685689
R
computes eigenvectors in the negative direction. This can be undone by multiplying the values by -1. The data frame loadings_TG
is constructed to store these
# Multiply by -1 and extract the eigenvectors
loadings_TG <- -eigen_TG$vectors
row.names(loadings_TG) <- c("Weight", "Length", "Width", "Height", "MPG")
colnames(loadings_TG) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
loadings_TG
## PC1 PC2 PC3 PC4 PC5
## Weight -0.5319255 -0.15612322 -0.0351903 -0.3445670 -0.75678006
## Length -0.5211818 -0.02535945 -0.3900024 -0.4634432 0.60070372
## Width -0.5199298 0.05610290 -0.2712467 0.8080524 -0.00142446
## Height -0.1183135 -0.88766514 0.3826166 0.1142862 0.19645795
## MPG 0.4013724 -0.42881874 -0.7916372 0.0219995 -0.16685689
Looking at the first column shows that the first principal component is dominated mainly by Weight
, Length
and Height
, with some influence accredited to MPG
.
These results reveal that the first principal component decreases as Weight
, Length
and Width
increase, yet that it increases as MPG
increases. The relationship among these variables within the first principal component supports the results from the covariance matrix.
The second principal component is mainly influenced by Height
, although MPG
seems to have some contribution as well. The third principal component is dominated by MPG
; the fourth, by Width
; and the fifth, by Weight
and Length
.
It is risky to rename PC1 with these attributes because every attribute plays a role in the principal component. Nevertheless, it is beneficial to determine which variables are the most influential.
The following heatmap gives another way to picture these values, where influence increases as the color lightens. The heatmap confirms that PC1 is largely influenced by Weight
, Length
and Width
; PC2 by Height
, PC3 by MPG
; PC4 by Width
; and PC5 by Weight
and Length
.
# Create data frame of loadings after taking absolute value
loadings_TG_Abs <- abs(loadings_TG)
# Create heatmap of loadings
loadings_heatmap <- heatmap.2(loadings_TG_Abs,
dendrogram='none',
Rowv = FALSE,
Colv = FALSE,
trace = 'none',
main = "Principal Component Values",
margins = c(5,10),
cexRow = 1,
cexCol = 1)
The principal component scores are now calculated for each car. Scores are computed by forming a linear combination with the scaled values and the loadings. THe purpose of these scores is to map the previous data set onto the newly defined axes.
The PC1 and PC2 score computations for Alfa Romeo Giulietta are demonstrated below.
scaled_TG[1,]
## Weight Length Width Height MPG
## -0.4046200 -0.2288006 -0.2012746 -0.3378505 1.1714914
loadings_TG[,1:2]
## PC1 PC2
## Weight -0.5319255 -0.15612322
## Length -0.5211818 -0.02535945
## Width -0.5199298 0.05610290
## Height -0.1183135 -0.88766514
## MPG 0.4013724 -0.42881874
\[ \begin{aligned} PC1 = -0.4046200\cdot -0.5319255 \\ + 0.2288006\cdot -0.5211818 \\ + -0.2012746\cdot -0.5199298 \\ + -0.3378505\cdot -0.1183135 \\ + 1.1714914\cdot 0.4013724 \\ \approx .9492996 \end{aligned} \]
\[ \begin{aligned} PC2 = -0.4046200\cdot -0.15612322 \\ + 0.2288006\cdot -0.02535945 \\ + -0.2012746\cdot 0.05610290 \\ + -0.3378505\cdot -0.88766514 \\ + 1.1714914\cdot -0.42881874 \\ \approx -0.1447786 \end{aligned} \]
The Alfa Romeo Giulietta has a PC1 score of about 0.949 and a PC2 score of about -0.145. These values are confirmed below in the PC score matrix. Likewise, the Guilietta is located at these coordinates in the PC 1 and 2 scatter plot towards the end of the tutorial.
# Create matrix with calculated PC scores
PC_Matrix <- as.matrix(scaled_TG) %*% loadings_TG
row.names(PC_Matrix) <- TG_Complete$Car
colnames(PC_Matrix) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
# Create data frame with PC scores
PC <- as.data.frame(PC_Matrix)
head(PC)
## PC1 PC2 PC3 PC4
## Alfa Romeo_Giulietta 0.9492996 -0.1447786 -0.8985969 0.06997496
## Alfa Romeo_MiTo 1.7527658 0.4509817 0.3057446 -0.09282942
## Aston Martin_Cygnet 3.4075066 0.0218035 1.0892155 0.72767335
## Aston Martin_V12 Zagato -0.8697803 2.2875133 0.6413883 0.14315947
## Aston Martin_Vanquish -1.5790254 1.9641735 0.2194633 0.17014539
## Aston Martin_Vantage -0.7364187 2.1683274 0.5220498 0.19731682
## PC5
## Alfa Romeo_Giulietta -0.092791117
## Alfa Romeo_MiTo 0.200782593
## Aston Martin_Cygnet -0.950771737
## Aston Martin_V12 Zagato -0.384002312
## Aston Martin_Vanquish -0.004069745
## Aston Martin_Vantage -0.309024071
In order to determine which principal components should be used, the explained variance is calculated. Principal components with the highest explained variance are then considered to be significant.
# Calculate explained variance
PVE <- eigen_TG$values / sum(eigen_TG$values)
round(PVE, 2)
## [1] 0.63 0.22 0.09 0.03 0.02
Based on the above results, the first and second principal components are the most significant, together explaining 85% of the total variance. Although the third component only explains 9% of the variance, it will be plotted along with the others for another perspective.
The following code shows how the first and second principal components are plotted in Figure 8. Similar code produces Figure 9 and 10. These graphs are created with the plotly
package, making them interactive. Note that the color bar meaures MPG
.
# Font style
f <- list(family = "Courier New, monospace", size = 18, color = "black")
# Axes Specifications
x1 <- list(title = "PC1", titlefont = list(size = 10), range = c(-5,5))
y1 <- list(title = "PC2", titlefont = list(size = 10), range = c(-5,5))
# Title Specifications
a <- list(text = "PC 1 and 2",
font = f,
xref = "paper",
yref = "paper",
yanchor = "bottom",
xanchor = "center",
align = "center",
x = 0.5,
y = 1,
showarrow = FALSE)
# Create graph
pc1_2 <- plot_ly(PC,
x = ~PC1,
y = ~PC2,
type = 'scatter',
mode = 'markers',
color = TG_Complete$MPG,
colors = "BuGn",
hoverinfo = 'text',
text = ~paste(TG_Complete$Car)) %>%
layout(annotations = a,
xaxis = x1,
yaxis = y1,
showlegend = FALSE)
# Display plots
subplot(pc1_2, pc2_3, pc1_3, titleX = TRUE, titleY = TRUE)
Consider the first graph, “PC1 and 2”. Interpretation requires an understanding of each principal component. As previously determined, the first principal component is mainly influenced by Weight
, Length
and Width
. Because these values were negative, each increase of x (PC1) coincides with a decrease in these attriubtes. The second principal component is dominated by Height
and suggests that an increase in y (PC2) indicates a decrease in height.
The first two principal components were also influenced considerably by MPG
. This is verified by the colorings of the data points, as there is an evident relationship between the x and y values and the fuel efficiency.
Hovering over specific data points gives more insight into their attributes. The two cars that are furthest to the left of the graph are the Rolls Royce Phantom and the Rolls Royce Phantom Coupe. These cars tend to be boxier, and so a low PC1 value is logical. On the other hand, the Peugeot 107, a rather compact car, has the greatest PC1, and so a smaller weight, length and width.
The upper left-hand corner of this graph contains many of the high performance cars, such as the Lamborghini Aventador, Pagani Huayra, Audi R8, Aston Martin Vanquish and Porsche 911. These cars tend to be lower to the ground and so their PC2 scores are higher. They also have poor fuel efficiencies, but why complain?
Each of the above graphs reduced the data set of 5 variables to a data set of 2 variables, allowing for graphical representations. These plots are now combined into an interactive three-dimensional graph, shown below. As with the two-dimensional plots, the colorbar meaures MPG
. Together, these three variables explain 94% of the total variance.
# Plot interactive 3D graph
plot_ly(PC,
x = ~PC1,
y = ~PC2,
z = ~PC3,
color = TG_Complete$MPG) %>%
layout(title = "Principal Components 1,2 and 3",
scene = list(xaxis = list(title = 'PC1'),
yaxis = list(title = 'PC2'),
zaxis = list(title = 'PC3')))
The TopGear
data set includes the following performance measures: Acceleration
, BHP
, MPG
, TopSpeed
and Torque
. Using similar computations as those used for dimensional attributes and miles per gallon, determine the principal components for this subset of variables. The following code is to get you started, and specific steps are displayed below.
# Load data
data(TopGear)
# Create data frame with attributes
TG_DF <- TopGear %>%
select(Maker, Model, Acceleration, BHP, MPG, TopSpeed, Torque) %>%
na.omit() %>%
filter(MPG < 235) %>%
mutate(Car = paste(Maker, Model, sep = "_"))
# Create data frame with only numeric variables
TG_Performance <- select(TG_DF, Acceleration, BHP, MPG, TopSpeed, Torque)
# View first Car attribute and performance measures
head(TG_DF[, c(8, 3:7)])
## Car Acceleration BHP MPG TopSpeed Torque
## 1 Alfa Romeo_Giulietta 11.3 105 64 115 236
## 2 Alfa Romeo_MiTo 10.7 105 49 116 95
## 3 Aston Martin_Cygnet 11.8 98 56 106 92
## 4 Aston Martin_DB9 4.6 517 19 183 457
## 5 Aston Martin_DB9 Volante 4.6 517 19 183 457
## 6 Aston Martin_V12 Zagato 4.2 510 17 190 420
scale
function applied over the TD_Performance
data frame.cov
function and the eigenvectors and eigenvalues with the eigen
function.%*%
operator. Save these values into a matrix and then convert them into a data frame.
Dallas, George. “Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction.” George Dallas, 27 Jan. 2015, georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/.
“Principal Components Analysis.” Principal Components Analysis · AFIT Data Science Lab R Programming Guide, Air Force Institute of Technology, afit-r.github.io/pca.