library(pcasimple)
#>
#> Attaching package: 'pcasimple'
#> The following object is masked from 'package:stats':
#>
#> screeplotDanielle Cortez, Daniel Cho, David Tu, Angelica Rivera
Principal Component Analysis is an extremely useful concept that is covered in STAT 580 that allows users to greatly simplify high-dimensional data without losing much information. However, this concept is complex and hard to learn, especially for students who have never seen it before. This R Package is meant to simplify and streamline the PCA process to assist in the teaching and learning of PCA. The functions of this package translates PCA into an ordered process that will be easier to follow.
We intend for users to follow these functions in descending order.
First, users can use the plot2D and plot3D
functions to visualize a subset of the data with its corresponding
principal components to help with visual understanding. Second, users
can use pcvalues to calculate the PCA loadings (components)
and eigenvalues on their full data frame to show how the data frame
affects principal components and how the eigenvalues can help determine
variance explained. Finally, the user can use a scree plot to visualize
the variance explained by each principal component or the total
proportion of variance explained by consecutive principal
components.
Since Outcome is the response variable for this dataset, we will create a subset of the dataset with just the independent variables. This same process should be done for any dataset before performing PCA
pima_diabetes <- pima_diabetes[,-9]Easier Method for Visualizing Principal Components
Motivation The functions plot2D() and
plot3D() were created to give users a more approachable
method to visualizing principal components alongside a scatter plot of
the data. Principal components are vectors that point in the direction
of largest variability in the data. In other words, when plotted, the
principal component arrows point in the direction where the data is most
spread out. Visualizing components are extremely helpful in
understanding the idea of PCA. It is important to note that the
principal components are ordered by their magnitude. PC1 points in the
direction of strongest variance, PC2 is the orthogonal principal
component that points at the second strongest direction of variance, and
so on.
Typically, multiple lines of code would be necessary to accomplish
this. However, plot2D() and plot3D() can
produce these visuals from a function call, where the plot can be
customized through arguments. It is important to note that the
dataframe must be standardized before using these
functions.
plot2D() produces a 2D scatter plot of the data, which
can be customized via its arguments.
The function plot2D() has three arguments:
dataframe containing two variablescharacter argument changing the orientation of
the plot
logical argument indicating whether to
include the principal components as arrows on the plotIf the requirements of these arguments are not met, then an error message will print to indicate what needs to be changed.
The function plot2D() requires that the data frame
is:
pima_scale <- scale(pima_diabetes)
plot2D(pima_scale)
#> [1] "Data should contain two variables."
#> NULLpima_subset <- subset(pima_scale, select = c("Glucose", "BloodPressure"))
plot2D(pima_subset)The argument type allows users to change the orientation of the scatter plot. If “original” is inputted, then the axes will be in terms of the original variables. If “PC” is inputted, then the axes will be in terms of the principal components of that subset. The default value is “original”.
plot2D(pima_subset, type = "PC")Arrows representing the principal components can be included by
changing the logical value of the argument arrows. If
TRUE is inputted, then arrows representing the principal
components will be included. If FALSE is included, then
arrows will not be shown. The default value is FALSE.
plot2D(pima_subset, arrows = TRUE)
plot2D(pima_subset, type = "PC", arrows = TRUE)plot3D() produces a 3D interactive scatter plot of the
data using the rgl package, which can be customized via its
arguments.
The function plot3D() has three arguments:
dataframe containing three variablescharacter argument changing the orientation of
the plot
logical argument indicating whether to
include the principal components as arrows on the plotIf the requirements of these arguments are not met, then an error message will print to indicate what needs to be changed.
The function plot3D() requires that the data frame
is:
However, because the vignette does not support interactive plots, example code will be included.
pima_subset_3 <- subset(pima_scale, select = c("Glucose", "BloodPressure", "BMI"))
plot3D(pima_subset_3)The argument type allows users to change the axes of the scatter plot. If “original” is inputted, then the axes will be in terms of the original variables. If “PC” is inputted, then the axes will be in terms of the principal components of that subset. The default value is “original”.
plot3D(pima_subset_3, type = "PC")Arrows representing the principal components can be included by
changing the logical value of the argument arrows. If
TRUE is inputted, then arrows representing the principal
components will be included. If FALSE is included, then
arrows will not be shown. The default value is FALSE.
plot3D(pima_subset_3, arrows = TRUE)plot3D(pima_subset_3, type = "PC", arrows = TRUE)rgl can be incompatible with Mac OS and
other operating systems. If the functions do not display interactive
plots, it may be necessary to download XQuartz or perform other
fixes.Motivation The function pcvalues() was
created to give users a straightforward approach in performing principal
component analysis, and placing the results into a comprehensible
format. These results include matrices of eigenvectors, loadings,
eigenvalues, and accumulated proportion of variance explained of the
principal components.
pcvalues() produces a matrix of values used for
principal component analysis which can be customized via its
arguments.
The function pcvalues() has three arguments:
dataframe with quantitative variableslogical argument indicating whether to
standardize the data or notinteger argument to change the decimal
digitscharacter argument that changes the output of
information in the matrix
If the requirements of these arguments are not met, then an error message will print to indicate what needs to be changed.
The function pcvalues() requires that the data frame is
standardized. This function has a scale argument that allows users to
standardize their data by setting scale = TRUE, where the default
setting is set to scale = FALSE for users who have already standardized
their data.
The digits argument changes the decimal digits of the output values.
The value argument allows users to change the output of values as eigenvectors, loadings, eigenvalues, or accumulated proportion of variance explained of the principal components. A matrix of eigenvectors can be produced with the argument value = “eigenvectors”.
pcvalues(data = pima_diabetes, scale = TRUE, value = "eigenvectors")| PC 1 | PC 2 | PC 3 | PC 4 | PC 5 | PC 6 | PC 7 | PC 8 | |
|---|---|---|---|---|---|---|---|---|
| Pregnancies | -0.1284 | 0.5938 | 0.0131 | -0.0807 | 0.4756 | -0.1936 | 0.5888 | 0.1178 |
| Glucose | -0.3931 | 0.1740 | -0.4679 | 0.4043 | -0.4663 | -0.0942 | 0.0602 | 0.4504 |
| BloodPressure | -0.3600 | 0.1839 | 0.5355 | -0.0560 | -0.3280 | 0.6341 | 0.1921 | -0.0113 |
| SkinThickness | -0.4398 | -0.3320 | 0.2377 | -0.0380 | 0.4879 | -0.0096 | -0.2822 | 0.5663 |
| Insulin | -0.4350 | -0.2508 | -0.3367 | 0.3499 | 0.3469 | 0.2707 | 0.1320 | -0.5486 |
| BMI | -0.4519 | -0.1010 | 0.3619 | -0.0536 | -0.2532 | -0.6854 | 0.0354 | -0.3415 |
| DiabetesPedigreeFunction | -0.2706 | -0.1221 | -0.4332 | -0.8337 | -0.1198 | 0.0858 | 0.0861 | -0.0083 |
| Age | -0.1980 | 0.6206 | -0.0752 | -0.0712 | 0.1093 | 0.0334 | -0.7121 | -0.2117 |
prcomp(scale(pima_diabetes))
#> Standard deviations (1, .., p=8):
#> [1] 1.4471973 1.3157546 1.0147068 0.9356971 0.8731234 0.8262133 0.6479322
#> [8] 0.6359733
#>
#> Rotation (n x k) = (8 x 8):
#> PC1 PC2 PC3 PC4
#> Pregnancies -0.1284321 0.5937858 -0.01308692 0.08069115
#> Glucose -0.3930826 0.1740291 0.46792282 -0.40432871
#> BloodPressure -0.3600026 0.1838921 -0.53549442 0.05598649
#> SkinThickness -0.4398243 -0.3319653 -0.23767380 0.03797608
#> Insulin -0.4350262 -0.2507811 0.33670893 -0.34994376
#> BMI -0.4519413 -0.1009598 -0.36186463 0.05364595
#> DiabetesPedigreeFunction -0.2706114 -0.1220690 0.43318905 0.83368010
#> Age -0.1980271 0.6205885 0.07524755 0.07120060
#> PC5 PC6 PC7 PC8
#> Pregnancies -0.4756057 0.193598168 0.58879003 0.117840984
#> Glucose 0.4663280 0.094161756 0.06015291 0.450355256
#> BloodPressure 0.3279531 -0.634115895 0.19211793 -0.011295538
#> SkinThickness -0.4878621 0.009589438 -0.28221253 0.566283799
#> Insulin -0.3469348 -0.270650609 0.13200992 -0.548621381
#> BMI 0.2532038 0.685372179 0.03536644 -0.341517637
#> DiabetesPedigreeFunction 0.1198105 -0.085784088 0.08609107 -0.008258731
#> Age -0.1092900 -0.033357170 -0.71208542 -0.211661979We can observe that the row names are our variables, and the column names are the principal components. The eigenvector values are the orthogonal transformation of the covariance matrix, where it represents the direction or projection of the data to a principal component.
\[Loadings = eigenvector * \sqrt{eigenvalue}\] The loadings values can be obtained by multiplying the eigenvector by the squared root of the eigenvalue of its respective principal component. This process “loads” the eigenvector with magnitude or variance, where the loading value represents the covariability of each variable with the principal component. This process is simplified with the argument value = “loadings”.
pcvalues(data = pima_diabetes, scale = TRUE, value = "loadings")| PC 1 | PC 2 | PC 3 | PC 4 | PC 5 | PC 6 | PC 7 | PC 8 | |
|---|---|---|---|---|---|---|---|---|
| Pregnancies | -0.1859 | 0.7813 | 0.0133 | -0.0755 | 0.4153 | -0.1600 | 0.3815 | 0.0749 |
| Glucose | -0.5689 | 0.2290 | -0.4748 | 0.3783 | -0.4072 | -0.0778 | 0.0390 | 0.2864 |
| BloodPressure | -0.5210 | 0.2420 | 0.5434 | -0.0524 | -0.2863 | 0.5239 | 0.1245 | -0.0072 |
| SkinThickness | -0.6365 | -0.4368 | 0.2412 | -0.0355 | 0.4260 | -0.0079 | -0.1829 | 0.3601 |
| Insulin | -0.6296 | -0.3300 | -0.3417 | 0.3274 | 0.3029 | 0.2236 | 0.0855 | -0.3489 |
| BMI | -0.6540 | -0.1328 | 0.3672 | -0.0502 | -0.2211 | -0.5663 | 0.0229 | -0.2172 |
| DiabetesPedigreeFunction | -0.3916 | -0.1606 | -0.4396 | -0.7801 | -0.1046 | 0.0709 | 0.0558 | -0.0053 |
| Age | -0.2866 | 0.8165 | -0.0764 | -0.0666 | 0.0954 | 0.0276 | -0.4614 | -0.1346 |
Eigenvalues are obtained by taking the sum of each squared loadings of their respective component. These values represent the amount of variance that can be explained by each principal component. The eigenvalues can be produced with the argument value = “eigenvalues”.
pcvalues(data = pima_diabetes, scale = TRUE, value = "eigenvalues")| Eigenvalue | |
|---|---|
| PC 1 | 2.0944 |
| PC 2 | 1.7312 |
| PC 3 | 1.0296 |
| PC 4 | 0.8755 |
| PC 5 | 0.7623 |
| PC 6 | 0.6826 |
| PC 7 | 0.4198 |
| PC 8 | 0.4045 |
However, these eigenvalues can be more meaningful by getting the ratio of each eigenvalue to the sum of every eigenvalue. This will give us the percentage of variance explained of each principal component in respect to the percentage of total variance explained of every principal component. In addition, the argument value = “variance” will output the cumulative percentage of variance explained.
pcvalues(data = pima_diabetes, scale = TRUE, value = "variance")| Var. Explained | Cumul. Var. Explained | |
|---|---|---|
| PC 1 | 0.2618 | 0.2618 |
| PC 2 | 0.2164 | 0.4782 |
| PC 3 | 0.1287 | 0.6069 |
| PC 4 | 0.1094 | 0.7163 |
| PC 5 | 0.0953 | 0.8116 |
| PC 6 | 0.0853 | 0.8970 |
| PC 7 | 0.0525 | 0.9494 |
| PC 8 | 0.0506 | 1.0000 |
Motivation The function screeplot() was
created to give users an easy way to visualize the individual and
cumulative variance explained by the principal components. Since the
goal of PCA is to reducing dimensionality by choosing the least number
of principal components that can explain the most variance of our
original data, this function will aid us in this decision.
Similar to plot2D() and plot3D(),
screeplot() can produce these visuals from a simple
function call, where the plot can be customized through arguments.
It is important to note that the data frame must be standardized
before using these functions.
##screeplot()
screeplot() produces a plot, which can be customized via
its arguments.
The function screeplot() has four arguments:
dataframe to perform principal component
analysis.character argument changing the type of
information to be displayed on the screeplot
numeric argument that changes the
threshold of total variance explainedlogical argument indicating whether to
standardize the data or notIf the requirements of these arguments are not met, then an error message will print to indicate what needs to be changed.
The function screeplot() requires that the data frame is
standardized. This function has a scale argument that allows users to
standardize their data by setting scale = TRUE, where the default
setting is set to scale = FALSE for users who have already standardized
their data.
The type argument allows users to change the type of screeplot to display eigenvalues or accumulated proportion of variance explained of the principal components. The user can display eigenvalues by setting type = “eigenvalue” or display the accumulated proportion of variance explained by setting type = “cumulative”. The default setting is set to cumulative since that is more intuitively understood.
screeplot(pima_diabetes, type = "eigenvalue", scale = T)
The eigenvalues shown above represent the total amount of variance that
can be explained by its principal component. We can observe that the
first three components have the highest eigenvalues, where the third
component marks the point of the largest drop. This “elbow” joint
typically marks where there will be diminishing returns in selecting
more principal components. Let’s take a look at the accumulated
proportion of variance explained to see if there is anything
different.
screeplot(pima_diabetes, type = "cumulative", scale = T)
We can see that if we selected only the first three components, the
accumulated proportion of variance explained would be less than 60%.
There are many different opinions on the the criteria of choosing the
number of components as it depends on the context of the research.
However, this function allows flexibility in setting their thresholds
for their own needs with the varexplain argument.
The varexplain argument is supplementary to the type = “cumulative” argument as it allows users to change the threshold of the total variance explained that is shown by a dashed, red line. This will aid in visualizing the number of principal components that are below and above the threshold. The default setting is set to 0.80, and is bounded between 0 and 1.
screeplot(pima_diabetes, varexplain = 0.90, type = "cumulative", scale = T)After selecting components
Based on the 90 percent threshold, we would select the first six principal components and remove the last two. This process reduces dimensionality by eliminating the data points of the variables that were in the principal components that explained the least amount of variance. It is important to note that we are not eliminating variables as a whole, but extracting the more important parts of the variables into a smaller space or less dimensions as defined in the selection of the number of principal components.
Users can go further by using their principal components as predictors for their linear regression models, also known as principle component regression. However, we will not cover how to perform principle component regression, but this link will provide more information.
Principal Component Analysis is a very useful tool in reducing dimensionality but the topic can get very complex. Our goal for this package was to streamline the process to assist users in gaining a deeper understanding of the topic with simplified functions. This article is a “One-Stop Shop” for PCA that provides additional information on the topic.