2023-04-14

History and Mathematical Definition

  • Long standing statistical method that was invented by Karl Person in 1901
  • Is defined by an orthogonal linear transformation of the data into a new basis (coordinate system) that maximizes the variance of the data within that coordinate system
  • Then each subsequent transformation finds the maximum possible remaining variance

Mathematics Behind PCA

  • The first Principal Component is found by
    \[w_1=arg \,\, max_{||w||=1} \left \{ \sum_{i} (t_1)_i^2 \right \}\]
    \[\,\,\,\,\,\,\,\,\,=arg \,\, max_{||w||=1} \left \{ \sum_{i} (x_{(i)} \cdot w)^2 \right \}\]
  • Further components are found by
    \[\hat{X}_k= X - \sum_{i}^{k-1} Xw_{(s)}w_{(s)}^T\]

Uses of Principal Component Analysis

  • Used to find clustering or patterns in high dimensional data sets (many variables) by bringing data down to a lower level with minimal to no data loss
  • For data visualization because PCA can bring data down to two or three dimensions
  • It can be used with any data set that has many variables but has been found especially useful when studying population genetics, microbiomes, and atmospheric science.

Example Using a Real Data Set

df1<- seeds_dataset
df1 <- na.omit(df1)

After importing the seeds data set form UC Irvine Machine Learning Repository the data is then saved in an easy to use variable name. Next the data is cleaned by removing any rows that contain any missing information.

PCA <- prcomp(df1[,1:7], scale = T)
df1 <- cbind(df1, PCA$x[,1:3])

Then the PCA is completed on the data frame with only using one command that employs the function prcomp(). Following that the function cbind() is used to bind the first three principle components as column variables into the data frame.

Adjusting the Variables

df1 <- rename(df1, Area=V1, Perimeter=V2, Compactness=V3, 
       KernelLength=V4,KernelWidth=V5, Asymmetry=V6, 
       GrooveLenght=V7,Type=V8)
df1$Type[which(df1$Type == 1)] <- 'Type 1'
df1$Type[which(df1$Type == 2)] <- 'Type 2'
df1$Type[which(df1$Type == 3)] <- 'Type 3'
df1$Type <- as.factor(df1$Type)
  • The data set variables are renamed in accordance with the attribute information contained on the website
  • Variable 8 “V8” now called “Type” shows the three different varieties of wheat seeds used in this data set
  • “Type” is set as a factor in order to use it as a defining categorical variable for creating graphs later

Plot of Variance between Principal Components

- The variance of a sample is defined as \(s^2 = {\frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2}\) and is a measure of dispersion that shows the spread of a data set
- By this Graph you can see that the majority of the spread of this data (variance) can be explained by just the first two Principle Components

3D Plot of the First Three Principal Components

Scatter Plot of PC2 by PC1

This plot helps to illustrate the power of PCA. Almost 90% of the spread in this data set was in PC1 and PC2. Now with this 2D graph you can see almost all the variation in the data and how nicely the seven variables are able to cluster to define the three types of wheat seeds.

Faceting and Density Plots of PC1 and PC2

Two more different styles of graphs to help visualize the spread/separation and also see where there is some overlap.

Comparison of Other Principle Components

- These two graphs again help to show the power of PCA and how the spread is really contained in the first two PC.
- In the first graph now comparing PC1 and PC3 you can see the clustering is still there but it has far more overlap
- In the second graph now comparing PC2 and PC3 the overlap is so much that the data points no longer are defining the different types well.