Introduction

Motivation

With the flood of available information today, data sets are often comprised of many variables. Although more information may allow for a better overall understanding of a data set, it often leads to complications. For example, an analyst may have difficulty determining insignificant variables, avoiding multicollinearity, capturing relationships between more than two or three variables and representing those relationships pictorially. Fortunately, Principal Component Analysis can be used to address these potential problems.

Principal Component Analysis

The role of Principal Component Analysis (PCA) is to address the above concerns by

  1. Determining the strongest relationships among variables in a data set
  2. Removing variables that add little information to the structure of the data
  3. Modifying the data set based on the strongest relationships

The strongest relationships among variables, defined as principal components, are those with the greatest variance. The selected principal components produce a redefined set of axes that capture information more efficiently than the previous variables and their axes. Relationships with little variance do not enhance the understanding of the data set, and so their exclusion results in a limited loss of information.

In essence, PCA compresses the data set through dimension (or variable) reduction, while maintaining pertinent information.

A Graphical Explanation

In order to visualize PCA, consider the simple data set of two variables shown below in Figure 1.

As previously mentioned, greater variance coincides with more information. The greatest variance in this data occurs along the diagonal blue line, pictured in Figure 2. To confirm this, Figure 3 shows each data point mapped to this line, revealing a well spread data set. In fact, there are no other lines that would generate a greater spread, making this line the first principal component.

The principal components must span the entire data space, requiring the second principal component to be perpendicular to the first. The red line, pictured in Figure 4, fulfills this requirement, and Figure 5 reveals that projecting the data points onto this line results in less variance than is represented in the first principal component.

As seen in Figure 6, the two principal components are perpendicular, span the whole data set and account for the greatest amount of variance.

The principal components can now form a new set of axes that better fit the data, thus facilitating visualization of the variance among the data points. The plot below shows Figure 6 rotated so that the principal components form the new axes. If the second principal component, represented by the red line, does not account for much variation, then this data set could be reduced down to a one-dimensional set, in which each point was mapped onto the first principal component.

Terminology

Whereas the previous example relied on visualization to determine the principal components, calculations are necessary when performing real analyses. Before continuing, some terminology will be introduced.

Principal components are also referred to as eigenvectors, as they indicate a direction. Each eigenvector is paired with an eigenvalue, a measurement of the variance along the eigenvector. Eigenvectors with the greatest eigenvalues are considered significant and are labeled as principal components.

A data set of m dimensions must contain m principal components, hence why the previous example, of two dimensions, only required two principal components.

Top Gear Data Set

A data set from the car-review television show, Top Gear, will be used to go through the steps for PCA. The source code for the data set can be found here.

Required Packages

The following packages are required to reproduce the results in this analysis.

# Load packages
library(robustHD) # To load the TopGear data set
library(tidyverse) # For data manipulation and organization
library(grid) # For plot layouts 
library(gridExtra) # For plot layouts
library(plotly) # For interactive and 3-dimensional graphs
library(gplots) # For heatmap

Data Preparation

The TopGear data set contains 297 different cars and 32 attributes. For the purpose of this analysis, only Maker, Model, MPG and dimensional attributes (Weight, Height, Length and Width) will be selected. Weight is measured in kilograms and Height, Length and Width are measured in millimeters. Cars with any of these values missing or with infeasible mpgs (due to entry error) are removed. A new variable, Car, is created in order to combine Maker and Model.

# Load data set
data(TopGear)

# Create data frame with specified variables
TG_Complete <- TopGear %>%
  select(Maker, Model, Weight, Length, Width, Height, MPG) %>%
  na.omit() %>%
  filter(MPG < 235) %>%
  mutate(Car = paste(Maker, Model, sep = "_"))

# Create data frame with only numeric variables
TG_Data <- select(TG_Complete, Weight, Length, Width, Height, MPG)

# View `Car` and numeric variables
head(TG_Complete[,c(8, 3:7)])
##                       Car Weight Length Width Height MPG
## 1    Alfa Romeo_Giulietta   1385   4351  1798   1465  64
## 2         Alfa Romeo_MiTo   1090   4063  1720   1446  49
## 3     Aston Martin_Cygnet    988   3078  1680   1500  56
## 4 Aston Martin_V12 Zagato   1680   4385  1865   1250  17
## 5   Aston Martin_Vanquish   1739   4720  1910   1294  19
## 6    Aston Martin_Vantage   1630   4385  1865   1260  20

Step 1. Standardize

The first step in PCA is to standardize the data by subtracting the column mean from each value. This ensures that attributes and features have the same scale. The scale function is looped over the columns and the values are stored in a data frame. The first six rows of the scaled entries are displayed below.

# Create data frame with centered variables
scaled_TG <- apply(TG_Data, 2, scale)
head(scaled_TG)
##          Weight     Length      Width     Height        MPG
## [1,] -0.4046200 -0.2288006 -0.2012746 -0.3378505  1.1714914
## [2,] -1.1334710 -0.8805554 -1.0442430 -0.4618773  0.2325394
## [3,] -1.3854805 -3.1096474 -1.4765346 -0.1093799  0.6707170
## [4,]  0.3242310 -0.1518573  0.5228137 -1.7413124 -1.7705584
## [5,]  0.4700012  0.6062603  1.0091417 -1.4540923 -1.6453647
## [6,]  0.2006969 -0.1518573  0.5228137 -1.6760351 -1.5827679

Step 2. Construct Covariance Matrix and Compute Eigenvectors and Eigenvalues

Covariance Matrix

The signs of the values in the covariance matrix indicate whether paired attributes are positively related, negatively related or have no relationship. The results below show that any paired combination of Weight, Length, Width and Height are positively related. Variables Weight, Length and Width are each negatively related to MPG, whereas Height and MPG have a positive relationship.

# Calculate and display covariance matrix
(cov_TG <- cov(scaled_TG))
##            Weight     Length      Width    Height        MPG
## Weight  1.0000000  0.8651127  0.8199954 0.3258058 -0.5741026
## Length  0.8651127  1.0000000  0.8375446 0.1557690 -0.5206565
## Width   0.8199954  0.8375446  1.0000000 0.1070175 -0.5854857
## Height  0.3258058  0.1557690  0.1070175 1.0000000  0.1374961
## MPG    -0.5741026 -0.5206565 -0.5854857 0.1374961  1.0000000

Eigenvectors and Eigenvalues

The eigenvectors and eigenvalues are now calculated. A nice explanation about covariance matrices and eigenvectors can be found here.

# Calculate eigenvalues and eigenvectors
eigen_TG <- eigen(cov_TG)
# View first 6 rows of eigenvalues and eigenvectors
head(eigen_TG)
## $values
## [1] 3.1548078 1.1214118 0.4509107 0.1691783 0.1036914
## 
## $vectors
##            [,1]        [,2]       [,3]       [,4]        [,5]
## [1,]  0.5319255  0.15612322  0.0351903  0.3445670  0.75678006
## [2,]  0.5211818  0.02535945  0.3900024  0.4634432 -0.60070372
## [3,]  0.5199298 -0.05610290  0.2712467 -0.8080524  0.00142446
## [4,]  0.1183135  0.88766514 -0.3826166 -0.1142862 -0.19645795
## [5,] -0.4013724  0.42881874  0.7916372 -0.0219995  0.16685689

Step 3. Determine Principal Components

Loadings

By default, R computes eigenvectors in the negative direction. This can be undone by multiplying the values by -1. The data frame loadings_TG is constructed to store these

# Multiply by -1 and extract the eigenvectors 
loadings_TG <- -eigen_TG$vectors
row.names(loadings_TG) <- c("Weight", "Length", "Width", "Height", "MPG")
colnames(loadings_TG) <- c("PC1", "PC2", "PC3", "PC4", "PC5")
loadings_TG
##               PC1         PC2        PC3        PC4         PC5
## Weight -0.5319255 -0.15612322 -0.0351903 -0.3445670 -0.75678006
## Length -0.5211818 -0.02535945 -0.3900024 -0.4634432  0.60070372
## Width  -0.5199298  0.05610290 -0.2712467  0.8080524 -0.00142446
## Height -0.1183135 -0.88766514  0.3826166  0.1142862  0.19645795
## MPG     0.4013724 -0.42881874 -0.7916372  0.0219995 -0.16685689

Looking at the first column shows that the first principal component is dominated mainly by Weight, Length and Height, with some influence accredited to MPG.

These results reveal that the first principal component decreases as Weight, Length and Width increase, yet that it increases as MPG increases. The relationship among these variables within the first principal component supports the results from the covariance matrix.

The second principal component is mainly influenced by Height, although MPG seems to have some contribution as well. The third principal component is dominated by MPG; the fourth, by Width; and the fifth, by Weight and Length.

It is risky to rename PC1 with these attributes because every attribute plays a role in the principal component. Nevertheless, it is beneficial to determine which variables are the most influential.

The following heatmap gives another way to picture these values, where influence increases as the color lightens. The heatmap confirms that PC1 is largely influenced by Weight, Length and Width; PC2 by Height, PC3 by MPG; PC4 by Width; and PC5 by Weight and Length.

# Create data frame of loadings after taking absolute value
loadings_TG_Abs <- abs(loadings_TG)
# Create heatmap of loadings
loadings_heatmap <- heatmap.2(loadings_TG_Abs, 
                              dendrogram='none', 
                              Rowv = FALSE, 
                              Colv = FALSE, 
                              trace = 'none',
                              main = "Principal Component Values",
                              margins = c(5,10),
                              cexRow = 1, 
                              cexCol = 1)

Principal Component Scores

The principal component scores are now calculated for each car. Scores are computed by forming a linear combination with the scaled values and the loadings. THe purpose of these scores is to map the previous data set onto the newly defined axes.

The PC1 and PC2 score computations for Alfa Romeo Giulietta are demonstrated below.

scaled_TG[1,]
##     Weight     Length      Width     Height        MPG 
## -0.4046200 -0.2288006 -0.2012746 -0.3378505  1.1714914
loadings_TG[,1:2]
##               PC1         PC2
## Weight -0.5319255 -0.15612322
## Length -0.5211818 -0.02535945
## Width  -0.5199298  0.05610290
## Height -0.1183135 -0.88766514
## MPG     0.4013724 -0.42881874

\[ \begin{aligned} PC1 = -0.4046200\cdot -0.5319255 \\ + 0.2288006\cdot -0.5211818 \\ + -0.2012746\cdot -0.5199298 \\ + -0.3378505\cdot -0.1183135 \\ + 1.1714914\cdot 0.4013724 \\ \approx .9492996 \end{aligned} \]


\[ \begin{aligned} PC2 = -0.4046200\cdot -0.15612322 \\ + 0.2288006\cdot -0.02535945 \\ + -0.2012746\cdot 0.05610290 \\ + -0.3378505\cdot -0.88766514 \\ + 1.1714914\cdot -0.42881874 \\ \approx -0.1447786 \end{aligned} \]

The Alfa Romeo Giulietta has a PC1 score of about 0.949 and a PC2 score of about -0.145. These values are confirmed below in the PC score matrix. Likewise, the Guilietta is located at these coordinates in the PC 1 and 2 scatter plot towards the end of the tutorial.

# Create matrix with calculated PC scores
PC_Matrix <- as.matrix(scaled_TG) %*% loadings_TG
row.names(PC_Matrix) <- TG_Complete$Car
colnames(PC_Matrix) <- c("PC1", "PC2", "PC3", "PC4", "PC5")

# Create data frame with PC scores
PC <- as.data.frame(PC_Matrix)
head(PC)
##                                PC1        PC2        PC3         PC4
## Alfa Romeo_Giulietta     0.9492996 -0.1447786 -0.8985969  0.06997496
## Alfa Romeo_MiTo          1.7527658  0.4509817  0.3057446 -0.09282942
## Aston Martin_Cygnet      3.4075066  0.0218035  1.0892155  0.72767335
## Aston Martin_V12 Zagato -0.8697803  2.2875133  0.6413883  0.14315947
## Aston Martin_Vanquish   -1.5790254  1.9641735  0.2194633  0.17014539
## Aston Martin_Vantage    -0.7364187  2.1683274  0.5220498  0.19731682
##                                  PC5
## Alfa Romeo_Giulietta    -0.092791117
## Alfa Romeo_MiTo          0.200782593
## Aston Martin_Cygnet     -0.950771737
## Aston Martin_V12 Zagato -0.384002312
## Aston Martin_Vanquish   -0.004069745
## Aston Martin_Vantage    -0.309024071

Explained Variance

In order to determine which principal components should be used, the explained variance is calculated. Principal components with the highest explained variance are then considered to be significant.

# Calculate explained variance
PVE <- eigen_TG$values / sum(eigen_TG$values)
round(PVE, 2)
## [1] 0.63 0.22 0.09 0.03 0.02

Based on the above results, the first and second principal components are the most significant, together explaining 85% of the total variance. Although the third component only explains 9% of the variance, it will be plotted along with the others for another perspective.

The following code shows how the first and second principal components are plotted in Figure 8. Similar code produces Figure 9 and 10. These graphs are created with the plotly package, making them interactive. Note that the color bar meaures MPG.

Plots

# Font style
f <- list(family = "Courier New, monospace", size = 18, color = "black")

# Axes Specifications 
x1 <- list(title = "PC1", titlefont = list(size = 10), range = c(-5,5))
y1 <- list(title = "PC2", titlefont = list(size = 10), range = c(-5,5))

# Title Specifications
a <- list(text = "PC 1 and 2",
          font = f,
          xref = "paper",
          yref = "paper",
          yanchor = "bottom",
          xanchor = "center",
          align = "center",
          x = 0.5,
          y = 1,
          showarrow = FALSE)

# Create graph
pc1_2 <- plot_ly(PC, 
                 x = ~PC1, 
                 y = ~PC2, 
                 type = 'scatter', 
                 mode = 'markers', 
                 color = TG_Complete$MPG, 
                 colors = "BuGn", 
                 hoverinfo = 'text', 
                 text = ~paste(TG_Complete$Car)) %>%
                layout(annotations = a, 
                       xaxis = x1, 
                       yaxis = y1, 
                       showlegend = FALSE)
# Display plots
subplot(pc1_2, pc2_3, pc1_3, titleX = TRUE, titleY = TRUE)

PC1 and 2 Interpretation

Consider the first graph, “PC1 and 2”. Interpretation requires an understanding of each principal component. As previously determined, the first principal component is mainly influenced by Weight, Length and Width. Because these values were negative, each increase of x (PC1) coincides with a decrease in these attriubtes. The second principal component is dominated by Height and suggests that an increase in y (PC2) indicates a decrease in height.

The first two principal components were also influenced considerably by MPG. This is verified by the colorings of the data points, as there is an evident relationship between the x and y values and the fuel efficiency.

Hovering over specific data points gives more insight into their attributes. The two cars that are furthest to the left of the graph are the Rolls Royce Phantom and the Rolls Royce Phantom Coupe. These cars tend to be boxier, and so a low PC1 value is logical. On the other hand, the Peugeot 107, a rather compact car, has the greatest PC1, and so a smaller weight, length and width.

The upper left-hand corner of this graph contains many of the high performance cars, such as the Lamborghini Aventador, Pagani Huayra, Audi R8, Aston Martin Vanquish and Porsche 911. These cars tend to be lower to the ground and so their PC2 scores are higher. They also have poor fuel efficiencies, but why complain?

3D Graph

Each of the above graphs reduced the data set of 5 variables to a data set of 2 variables, allowing for graphical representations. These plots are now combined into an interactive three-dimensional graph, shown below. As with the two-dimensional plots, the colorbar meaures MPG. Together, these three variables explain 94% of the total variance.

# Plot interactive 3D graph
plot_ly(PC, 
        x = ~PC1, 
        y = ~PC2, 
        z = ~PC3, 
        color = TG_Complete$MPG) %>%
  layout(title = "Principal Components 1,2 and 3",
         scene = list(xaxis = list(title = 'PC1'),
                      yaxis = list(title = 'PC2'),
                      zaxis = list(title = 'PC3')))

Exercises

The TopGear data set includes the following performance measures: Acceleration, BHP, MPG, TopSpeed and Torque. Using similar computations as those used for dimensional attributes and miles per gallon, determine the principal components for this subset of variables. The following code is to get you started, and specific steps are displayed below.

# Load data
data(TopGear)

# Create data frame with attributes
TG_DF <- TopGear %>%
  select(Maker, Model, Acceleration, BHP, MPG, TopSpeed, Torque) %>%
  na.omit() %>%
  filter(MPG < 235) %>%
  mutate(Car = paste(Maker, Model, sep = "_"))

# Create data frame with only numeric variables
TG_Performance <- select(TG_DF, Acceleration, BHP, MPG, TopSpeed, Torque)

# View first Car attribute and performance measures
head(TG_DF[, c(8, 3:7)])
##                        Car Acceleration BHP MPG TopSpeed Torque
## 1     Alfa Romeo_Giulietta         11.3 105  64      115    236
## 2          Alfa Romeo_MiTo         10.7 105  49      116     95
## 3      Aston Martin_Cygnet         11.8  98  56      106     92
## 4         Aston Martin_DB9          4.6 517  19      183    457
## 5 Aston Martin_DB9 Volante          4.6 517  19      183    457
## 6  Aston Martin_V12 Zagato          4.2 510  17      190    420

  1. Copy the code above to get started.
  2. Standardize the data using the scale function applied over the TD_Performance data frame.
  3. Compute the covariance matrix with the cov function and the eigenvectors and eigenvalues with the eigen function.
  4. Extract the eigenvectors. Remember to multiply the values by -1.
  5. Calculate the principal component scores using the scaled value, loadings and the %*% operator. Save these values into a matrix and then convert them into a data frame.
  6. Calculate the explained variance.
  7. Determine the most significant principal components and plot them.

References

  • Dallas, George. “Principal Component Analysis 4 Dummies: Eigenvectors, Eigenvalues and Dimension Reduction.” George Dallas, 27 Jan. 2015, georgemdallas.wordpress.com/2013/10/30/principal-component-analysis-4-dummies-eigenvectors-eigenvalues-and-dimension-reduction/.

  • “Principal Components Analysis.” Principal Components Analysis · AFIT Data Science Lab R Programming Guide, Air Force Institute of Technology, afit-r.github.io/pca.