pca-simplified-vignette

library(pcasimple)
#> 
#> Attaching package: 'pcasimple'
#> The following object is masked from 'package:stats':
#> 
#>     screeplot

Introduction

Group 2 Members

Danielle Cortez, Daniel Cho, David Tu, Angelica Rivera

Goal of pcasimple

Principal Component Analysis is an extremely useful concept that is covered in STAT 580 that allows users to greatly simplify high-dimensional data without losing much information. However, this concept is complex and hard to learn, especially for students who have never seen it before. This R Package is meant to simplify and streamline the PCA process to assist in the teaching and learning of PCA. The functions of this package translates PCA into an ordered process that will be easier to follow.

Functions

  • 2D Scatter Plots
  • 3D Scatter Plots
  • PCA Values
  • Scree Plots

We intend for users to follow these functions in descending order. First, users can use the plot2D and plot3D functions to visualize a subset of the data with its corresponding principal components to help with visual understanding. Second, users can use pcvalues to calculate the PCA loadings (components) and eigenvalues on their full data frame to show how the data frame affects principal components and how the eigenvalues can help determine variance explained. Finally, the user can use a scree plot to visualize the variance explained by each principal component or the total proportion of variance explained by consecutive principal components.

Included Data

  • Pima Indians Diabetes Database: Available on Kaggle, this is a dataset from the National Institute of Diabetes and Digestive Kidney Diseases, collected to help predict whether a patient has diabetes based on diagnostic measurements. Columns include Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, and Outcome.

Since Outcome is the response variable for this dataset, we will create a subset of the dataset with just the independent variables. This same process should be done for any dataset before performing PCA

pima_diabetes <- pima_diabetes[,-9]

2D and 3D Principal Component Plots

Easier Method for Visualizing Principal Components

Motivation The functions plot2D() and plot3D() were created to give users a more approachable method to visualizing principal components alongside a scatter plot of the data. Principal components are vectors that point in the direction of largest variability in the data. In other words, when plotted, the principal component arrows point in the direction where the data is most spread out. Visualizing components are extremely helpful in understanding the idea of PCA. It is important to note that the principal components are ordered by their magnitude. PC1 points in the direction of strongest variance, PC2 is the orthogonal principal component that points at the second strongest direction of variance, and so on.

Typically, multiple lines of code would be necessary to accomplish this. However, plot2D() and plot3D() can produce these visuals from a function call, where the plot can be customized through arguments. It is important to note that the dataframe must be standardized before using these functions.

2D Plots

plot2D() produces a 2D scatter plot of the data, which can be customized via its arguments.

The function plot2D() has three arguments:

  1. data: a dataframe containing two variables
  2. type: a character argument changing the orientation of the plot
    • “original”: the axes of the scatter plot are the original variables
    • “PC”: the axes of the scatter plot are the principal components
  3. arrows: a logical argument indicating whether to include the principal components as arrows on the plot

If the requirements of these arguments are not met, then an error message will print to indicate what needs to be changed.

Data

The function plot2D() requires that the data frame is:

  • Standardized
  • Contain only two variables (two columns)
pima_scale <- scale(pima_diabetes)
plot2D(pima_scale)
#> [1] "Data should contain two variables."
#> NULL
pima_subset <- subset(pima_scale, select = c("Glucose", "BloodPressure"))
plot2D(pima_subset)

Type

The argument type allows users to change the orientation of the scatter plot. If “original” is inputted, then the axes will be in terms of the original variables. If “PC” is inputted, then the axes will be in terms of the principal components of that subset. The default value is “original”.

plot2D(pima_subset, type = "PC")

Principal Component Arrows

Arrows representing the principal components can be included by changing the logical value of the argument arrows. If TRUE is inputted, then arrows representing the principal components will be included. If FALSE is included, then arrows will not be shown. The default value is FALSE.

plot2D(pima_subset, arrows = TRUE)


plot2D(pima_subset, type = "PC", arrows = TRUE)

3D Plots

plot3D() produces a 3D interactive scatter plot of the data using the rgl package, which can be customized via its arguments.

The function plot3D() has three arguments:

  1. data: a dataframe containing three variables
  2. type: a character argument changing the orientation of the plot
    • “original”: the axes of the scatter plot are the original variables
    • “PC”: the axes of the scatter plot are the principal components
  3. arrows: a logical argument indicating whether to include the principal components as arrows on the plot

If the requirements of these arguments are not met, then an error message will print to indicate what needs to be changed.

Data

The function plot3D() requires that the data frame is:

  • Standardized
  • Contain only three variables (three columns)

However, because the vignette does not support interactive plots, example code will be included.

pima_subset_3 <- subset(pima_scale, select = c("Glucose", "BloodPressure", "BMI"))

plot3D(pima_subset_3)

Type

The argument type allows users to change the axes of the scatter plot. If “original” is inputted, then the axes will be in terms of the original variables. If “PC” is inputted, then the axes will be in terms of the principal components of that subset. The default value is “original”.

plot3D(pima_subset_3, type = "PC")

Principal Component Arrows

Arrows representing the principal components can be included by changing the logical value of the argument arrows. If TRUE is inputted, then arrows representing the principal components will be included. If FALSE is included, then arrows will not be shown. The default value is FALSE.

plot3D(pima_subset_3, arrows = TRUE)
plot3D(pima_subset_3, type = "PC", arrows = TRUE)

Limitations of 2D and 3D Plots

2D Plots

  1. Data frame must be standardized and contain two variables.
  2. Data must be numeric.
  3. Because the function only plots a subset of the data, the true principle components of the entire dataframe is absent. As such, this function should only be used to learn and understand principal components.

3D Plots

  1. Data frame must be standardized and contain three variables.
  2. Data must be numeric.
  3. Because the function only plots a subset of the data, the true principle components of the entire dataframe is absent. As such, this function should only be used to learn and understand principal components.
  4. The package rgl can be incompatible with Mac OS and other operating systems. If the functions do not display interactive plots, it may be necessary to download XQuartz or perform other fixes.

Similar Functions from STAT 580

PCA Values

Motivation The function pcvalues() was created to give users a straightforward approach in performing principal component analysis, and placing the results into a comprehensible format. These results include matrices of eigenvectors, loadings, eigenvalues, and accumulated proportion of variance explained of the principal components.

pcvalues() produces a matrix of values used for principal component analysis which can be customized via its arguments.

The function pcvalues() has three arguments:

  1. data: a dataframe with quantitative variables
  2. scale: a logical argument indicating whether to standardize the data or not
  3. digits: a integer argument to change the decimal digits
  4. value: a character argument that changes the output of information in the matrix
    • “eigenvectors”: matrix of eigenvectors of the principal components
    • “loadings”: matrix of loadings of the principal components
    • “eigenvalues”: matrix of eigenvalues of the principal components
    • “variance”: matrix of total variance explained of the principal components

If the requirements of these arguments are not met, then an error message will print to indicate what needs to be changed.

Data and Scale

The function pcvalues() requires that the data frame is standardized. This function has a scale argument that allows users to standardize their data by setting scale = TRUE, where the default setting is set to scale = FALSE for users who have already standardized their data.

Integer

The digits argument changes the decimal digits of the output values.

Value

The value argument allows users to change the output of values as eigenvectors, loadings, eigenvalues, or accumulated proportion of variance explained of the principal components. A matrix of eigenvectors can be produced with the argument value = “eigenvectors”.

pcvalues(data = pima_diabetes, scale = TRUE, value = "eigenvectors")
Matrix of Eigenvectors
PC 1 PC 2 PC 3 PC 4 PC 5 PC 6 PC 7 PC 8
Pregnancies -0.1284 0.5938 0.0131 -0.0807 0.4756 -0.1936 0.5888 0.1178
Glucose -0.3931 0.1740 -0.4679 0.4043 -0.4663 -0.0942 0.0602 0.4504
BloodPressure -0.3600 0.1839 0.5355 -0.0560 -0.3280 0.6341 0.1921 -0.0113
SkinThickness -0.4398 -0.3320 0.2377 -0.0380 0.4879 -0.0096 -0.2822 0.5663
Insulin -0.4350 -0.2508 -0.3367 0.3499 0.3469 0.2707 0.1320 -0.5486
BMI -0.4519 -0.1010 0.3619 -0.0536 -0.2532 -0.6854 0.0354 -0.3415
DiabetesPedigreeFunction -0.2706 -0.1221 -0.4332 -0.8337 -0.1198 0.0858 0.0861 -0.0083
Age -0.1980 0.6206 -0.0752 -0.0712 0.1093 0.0334 -0.7121 -0.2117
prcomp(scale(pima_diabetes))
#> Standard deviations (1, .., p=8):
#> [1] 1.4471973 1.3157546 1.0147068 0.9356971 0.8731234 0.8262133 0.6479322
#> [8] 0.6359733
#> 
#> Rotation (n x k) = (8 x 8):
#>                                 PC1        PC2         PC3         PC4
#> Pregnancies              -0.1284321  0.5937858 -0.01308692  0.08069115
#> Glucose                  -0.3930826  0.1740291  0.46792282 -0.40432871
#> BloodPressure            -0.3600026  0.1838921 -0.53549442  0.05598649
#> SkinThickness            -0.4398243 -0.3319653 -0.23767380  0.03797608
#> Insulin                  -0.4350262 -0.2507811  0.33670893 -0.34994376
#> BMI                      -0.4519413 -0.1009598 -0.36186463  0.05364595
#> DiabetesPedigreeFunction -0.2706114 -0.1220690  0.43318905  0.83368010
#> Age                      -0.1980271  0.6205885  0.07524755  0.07120060
#>                                 PC5          PC6         PC7          PC8
#> Pregnancies              -0.4756057  0.193598168  0.58879003  0.117840984
#> Glucose                   0.4663280  0.094161756  0.06015291  0.450355256
#> BloodPressure             0.3279531 -0.634115895  0.19211793 -0.011295538
#> SkinThickness            -0.4878621  0.009589438 -0.28221253  0.566283799
#> Insulin                  -0.3469348 -0.270650609  0.13200992 -0.548621381
#> BMI                       0.2532038  0.685372179  0.03536644 -0.341517637
#> DiabetesPedigreeFunction  0.1198105 -0.085784088  0.08609107 -0.008258731
#> Age                      -0.1092900 -0.033357170 -0.71208542 -0.211661979

We can observe that the row names are our variables, and the column names are the principal components. The eigenvector values are the orthogonal transformation of the covariance matrix, where it represents the direction or projection of the data to a principal component.

\[Loadings = eigenvector * \sqrt{eigenvalue}\] The loadings values can be obtained by multiplying the eigenvector by the squared root of the eigenvalue of its respective principal component. This process “loads” the eigenvector with magnitude or variance, where the loading value represents the covariability of each variable with the principal component. This process is simplified with the argument value = “loadings”.

pcvalues(data = pima_diabetes, scale = TRUE, value = "loadings")
Component Matrix i.e. loadings
PC 1 PC 2 PC 3 PC 4 PC 5 PC 6 PC 7 PC 8
Pregnancies -0.1859 0.7813 0.0133 -0.0755 0.4153 -0.1600 0.3815 0.0749
Glucose -0.5689 0.2290 -0.4748 0.3783 -0.4072 -0.0778 0.0390 0.2864
BloodPressure -0.5210 0.2420 0.5434 -0.0524 -0.2863 0.5239 0.1245 -0.0072
SkinThickness -0.6365 -0.4368 0.2412 -0.0355 0.4260 -0.0079 -0.1829 0.3601
Insulin -0.6296 -0.3300 -0.3417 0.3274 0.3029 0.2236 0.0855 -0.3489
BMI -0.6540 -0.1328 0.3672 -0.0502 -0.2211 -0.5663 0.0229 -0.2172
DiabetesPedigreeFunction -0.3916 -0.1606 -0.4396 -0.7801 -0.1046 0.0709 0.0558 -0.0053
Age -0.2866 0.8165 -0.0764 -0.0666 0.0954 0.0276 -0.4614 -0.1346

Eigenvalues are obtained by taking the sum of each squared loadings of their respective component. These values represent the amount of variance that can be explained by each principal component. The eigenvalues can be produced with the argument value = “eigenvalues”.

pcvalues(data = pima_diabetes, scale = TRUE, value = "eigenvalues")
Eigenvalues
Eigenvalue
PC 1 2.0944
PC 2 1.7312
PC 3 1.0296
PC 4 0.8755
PC 5 0.7623
PC 6 0.6826
PC 7 0.4198
PC 8 0.4045

However, these eigenvalues can be more meaningful by getting the ratio of each eigenvalue to the sum of every eigenvalue. This will give us the percentage of variance explained of each principal component in respect to the percentage of total variance explained of every principal component. In addition, the argument value = “variance” will output the cumulative percentage of variance explained.

pcvalues(data = pima_diabetes, scale = TRUE, value = "variance")
Variance
Var. Explained Cumul. Var. Explained
PC 1 0.2618 0.2618
PC 2 0.2164 0.4782
PC 3 0.1287 0.6069
PC 4 0.1094 0.7163
PC 5 0.0953 0.8116
PC 6 0.0853 0.8970
PC 7 0.0525 0.9494
PC 8 0.0506 1.0000

Scree Plots

Motivation The function screeplot() was created to give users an easy way to visualize the individual and cumulative variance explained by the principal components. Since the goal of PCA is to reducing dimensionality by choosing the least number of principal components that can explain the most variance of our original data, this function will aid us in this decision.

Similar to plot2D() and plot3D(), screeplot() can produce these visuals from a simple function call, where the plot can be customized through arguments. It is important to note that the data frame must be standardized before using these functions.

##screeplot()

screeplot() produces a plot, which can be customized via its arguments.

The function screeplot() has four arguments:

  1. data: a dataframe to perform principal component analysis.
  2. type: a character argument changing the type of information to be displayed on the screeplot
    • “eigenvalue”: eigenvalues of the principal components
    • “cumulative”: accumulated proportion of variance explained of the principal components
  3. varexplain: a numeric argument that changes the threshold of total variance explained
  4. scale: a logical argument indicating whether to standardize the data or not

If the requirements of these arguments are not met, then an error message will print to indicate what needs to be changed.

Data and Scale

The function screeplot() requires that the data frame is standardized. This function has a scale argument that allows users to standardize their data by setting scale = TRUE, where the default setting is set to scale = FALSE for users who have already standardized their data.

Type

The type argument allows users to change the type of screeplot to display eigenvalues or accumulated proportion of variance explained of the principal components. The user can display eigenvalues by setting type = “eigenvalue” or display the accumulated proportion of variance explained by setting type = “cumulative”. The default setting is set to cumulative since that is more intuitively understood.

screeplot(pima_diabetes, type = "eigenvalue", scale = T)

The eigenvalues shown above represent the total amount of variance that can be explained by its principal component. We can observe that the first three components have the highest eigenvalues, where the third component marks the point of the largest drop. This “elbow” joint typically marks where there will be diminishing returns in selecting more principal components. Let’s take a look at the accumulated proportion of variance explained to see if there is anything different.

screeplot(pima_diabetes, type = "cumulative", scale = T)

We can see that if we selected only the first three components, the accumulated proportion of variance explained would be less than 60%. There are many different opinions on the the criteria of choosing the number of components as it depends on the context of the research. However, this function allows flexibility in setting their thresholds for their own needs with the varexplain argument.

Varexplain

The varexplain argument is supplementary to the type = “cumulative” argument as it allows users to change the threshold of the total variance explained that is shown by a dashed, red line. This will aid in visualizing the number of principal components that are below and above the threshold. The default setting is set to 0.80, and is bounded between 0 and 1.

screeplot(pima_diabetes, varexplain = 0.90, type = "cumulative", scale = T)

After selecting components

Based on the 90 percent threshold, we would select the first six principal components and remove the last two. This process reduces dimensionality by eliminating the data points of the variables that were in the principal components that explained the least amount of variance. It is important to note that we are not eliminating variables as a whole, but extracting the more important parts of the variables into a smaller space or less dimensions as defined in the selection of the number of principal components.

Users can go further by using their principal components as predictors for their linear regression models, also known as principle component regression. However, we will not cover how to perform principle component regression, but this link will provide more information.

Principal Component Analysis is a very useful tool in reducing dimensionality but the topic can get very complex. Our goal for this package was to streamline the process to assist users in gaining a deeper understanding of the topic with simplified functions. This article is a “One-Stop Shop” for PCA that provides additional information on the topic.