MVA Class 6 — PCA

Introduction and Session Overview
- The session begins with Adam and Luofan Shu discussing the
attendance of other participants, including Peter and Microsoft
Azure, and confirming that Professor Jay Verkuilen is present (00:02:57).
- Jay Verkuilen starts the session, mentioning that he knows at least
one or two people won’t be attending, and decides to get started (00:04:31).
- The session is a continuation of previous discussions on preliminary
topics, including descriptive statistics, which Jay Verkuilen believes
are undervalued and should be done more often before proceeding with
other analyses (00:04:44).
- Jay Verkuilen emphasizes the importance of checking for improperly
coded data or other issues before proceeding with analysis (00:05:06).
Principal Component Analysis (PCA): Introduction and Historical
Context
- The main topic of the session is Principal Components Analysis (Principal
component analysis), which is arguably the first multivariate method
and was proposed by Karl Pearson in
the early 20th century (00:05:25).
- PCA was developed in the late 19th and early 20th centuries, around
the same time as multiple linear regression, which was invented in 1897
(00:05:46).
- Jay Verkuilen notes that many statistical methods were initially
proposed as separate techniques, but were later found to be special
cases of other methods, such as Analysis of
variance being a special case of multiple regression (00:06:04).
- Carl Pearson proposed PCA in the early 1900s, around the same time
that the concept of eigenvalues was being developed (00:06:42).
- Jay Verkuilen mentions that the development of PCA was facilitated
by Pearson’s wealth and access to computational resources, as well as
the work of his contemporaries, including Francis Galton
(00:07:15).
- The era in which PCA was developed was marked by scientific racism,
and many of the key figures in the field held problematic views by
modern standards (00:07:55).
- The concept of principal components analysis (Principal
component analysis) was initially proposed by Karl Pearson but
didn’t get its name until the 1930s when it was named by statistician
and economist Harold
Hotelling, who also devised canonical correlation (00:08:52).
- Harold Hotelling is credited with devising principal components
analysis and canonical correlation, with the latter being related to
discriminant analysis (00:09:03).
- Canonical correlation is a complex topic that can be difficult for
many people to understand, leading some to avoid it (00:09:26).
- Many classical multivariate statistical methods, including PCA and
factor analysis, were devised in the first half of the 20th century (00:09:43).
- Although these methods were developed in the early 20th century, it
wasn’t until the advent of computers that they became practical (00:10:06).
Data Visualization and Dimensionality Reduction in Multivariate
Statistics
- The Swiss faces data set, which has six outcome variables, was
previously analyzed using Principal
component analysis, revealing a size One-dimensional
space and a shape dimension (00:10:47).
- The data set has six variables (P = 6), making it impossible to
visualize in six-dimensional space, which is a common challenge in
multivariate statistics (00:11:47).
- One of the primary goals of multivariate statistics is to find an
optimal way to reduce higher-dimensional data sets to a smaller
dimensional space, making it easier to analyze and visualize (00:12:04).
- The idea is to find the best combination of variables to reduce the
data set from six dimensions to a lower number, such as two dimensions
(K* = 2), while preserving the most important information (00:12:38).
- When dealing with high-dimensional data, it can be challenging to
visualize and understand the relationships between variables, especially
when K equals 6 or more, but there are ways to approximate and trick the
system into making it more manageable (00:13:03).
Optimal Data Reduction and Variable Selection
- One approach to handling high-dimensional data is to use optimal
data reduction, which can provide information about the importance of
certain variables and potentially eliminate unnecessary ones (00:13:40).
- If certain variables do not appear in the optimal projection, it may
be possible to remove them, reducing the dimensionality of the data set,
as seen in the example where the last two variables are eliminated,
leaving a 4-dimensional data set (00:14:01).
- Principal
component analysis can help identify combinations of variables that
are important, such as overall size and shape dimensions, which were
found in a previous analysis (00:14:19).
- The overall size One-dimensional
space makes sense, as larger people tend to have larger heads, while
the shape dimension refers to the proportion of size, such as longer or
wider (00:14:45).
Data Pre-processing for PCA
- A Scatter
plot of two variables, Ltn (ear to bridge of nose) and Ltg (ear to
tip of the chin), shows a typical relationship between the two, with no
unusual patterns (00:15:29).
- The scatter plot represents the depth of the ear compared to the
front of the face, which is relevant for applications such as
mask-making (00:16:02).
- When performing Principal
component analysis, it is common to center the variables by
subtracting the mean, and standardize them to have the same scale,
usually by dividing by the standard deviation (00:16:52).
- The data is initially in millimeters, but it can be converted to
standard deviations by dividing by a constant, which is a matter of
shifting the units, and the range of the data is roughly the same, so
standardization might not be necessary, but it will be done anyway (00:17:43).
PCA: Geometric Interpretation and Singular Value Decomposition
(SVD)
- Principal Components Analysis (PCA) geometrically involves rotating
the axes to find the best axis, which goes through the mean and down the
middle of the data cloud, and then finding a second axis that is
perpendicular to the first one (00:18:36).
- The first axis is called PC1, and the second axis is called PC2, and
if there were more variables, the additional axes would be perpendicular
to each other (00:19:24).
- Principal
component analysis is similar to Singular
value decomposition (SVD), which finds vectors that are the
strongest, next strongest, and so on, and are all perpendicular to each
other (00:20:09).
- The goal of PCA is to find a new coordinate system that is optimal,
meaning that if only one One-dimensional
space could be chosen to represent the data, it would be PC1, which
captures the most variability (00:21:02).
- If a second dimension is chosen, it would be the one that is
maximally uncorrelated with the first PC, which would capture any
remaining information (00:21:36).
- Principal Components Analysis (PCA) is a method used to find the
components that account for the most variability in a dataset, with the
first principal component explaining the most variability, the second
explaining the next most, and so on (00:21:52).
- In a two-dimensional dataset, the first principal component will
always look like the average of the two dimensions, and the second
principal component will look like the difference between the two
dimensions (00:22:42).
- In higher-dimensional datasets, more complicated patterns can
emerge, but in two dimensions, the principal components will always be
perpendicular to each other (00:22:55).
- The math of Singular
value decomposition (SVD) provides the perpendicularity
relationships between the principal components (00:23:52).
- Principal
component analysis is an analysis of interdependence, where all
variables are peers of each other, and there is no outcome variable
being predicted (00:25:00).
- PCA is an unsupervised learning method, meaning that it is not used
to predict a specific outcome variable, but rather to understand the
interrelationships between all variables (00:25:18).
- Other unsupervised learning methods that will be covered in class
include correspondence analysis, Multidimensional
scaling, and clustering (00:25:55).
- PCA is also known by other names, such as the method named after Karl Pearson, Harold
Hotelling, and Loeve (00:24:40).
Supervised vs. Unsupervised Learning
- The terminology used in Principal
component analysis can be inconsistent, with different areas using
different names to describe the same method (00:24:23).
- The supervised learning method to be used is discriminant analysis,
where there is an outcome variable, as opposed to unsupervised learning,
which has no outcome variable and involves analyzing multiple things at
once (00:26:06).
Pre-processing Techniques for PCA
- Principal Components Analysis (PCA) is the first multivariate
analysis method, invented by Karl Pearson in
the 19th century and later cleaned up by Harold
Hotelling (00:26:54).
- PCA requires pre-processing, which involves thinking about the data
in the matrix, and this concept will be encountered again in
correspondence analysis (00:28:09).
- Pre-processing for Principal
component analysis typically involves column centering and
standardizing the data, which means subtracting the means from every
column and dividing each column by its standard deviation (00:28:27).
- The effect of column centering and standardizing is to remove
overall level and units, and this is the default in most programs,
including factoMineR (00:28:54).
- Another choice for pre-processing is row centering, which subtracts
the row mean out, and this can be legitimate or not, depending on
whether the means of the variables are comparable (00:29:55).
- Row centering removes individual differences, such as when the data
has a row-specific effect, and can make sense in certain situations (00:30:19).
- When analyzing data from a class, such as spelling, math, and social
studies percentages for 6th graders, it’s essential to consider that
scores may not be directly comparable due to individual differences,
like smarter kids performing better overall (00:30:46).
- To address this issue, subtracting the mean of each case can help
eliminate individual row differences, removing the “smarter kid effect”
and allowing for a focus on other patterns, such as subject-specific
performance (00:30:55).
- This technique is also applied in bio data analysis, where the first
component is often size, but researchers are more interested in shape
differences, such as changes in body shape across different age groups
(00:31:52).
- Row centering can be used to remove overall size differences and
focus on other patterns, such as developmental changes in body shape (00:32:23).
- Double centering or double standardization involves subtracting both
row and column means, which focuses on interaction and will be discussed
further in the context of correspondence analysis (00:32:43).
- Other normalization techniques, such as term frequency inverse
document frequency (Tf–idf), can be used in
text analysis to identify words that are unique to a particular document
but not common across all documents (00:33:24).
- TF-IDF takes into account the frequency of terms within a document
and their rarity across all documents, making it a useful technique for
text analysis (00:34:01).
- Alternative normalization methods, such as taking the natural log of
all values, can also be employed, depending on the specific research
question and data characteristics (00:33:38).
- In text analysis, certain words known as “stop words” are often
discarded because they are too common and provide no useful information,
examples of which include words like “a” and “the” (00:34:56).
- The TF-IDF (Term Frequency-Inverse Document Frequency) method is an
example of a normalization technique that down-weights common words,
making it a useful pre-processing step (00:35:20).
- Pre-processing is a crucial step that should be considered in
advance, as different techniques can significantly impact the results of
an analysis (00:35:49).
Missing Data and Descriptive Analysis
- Correspondence analysis uses a different pre-processing technique
than PCA (Principal
component analysis), which can lead to more meaningful results in
certain cases (00:36:07).
- It is essential to check for Missing data and
potential problems during the pre-processing stage, and there are
packages available, such as the missing data package for facto minor,
that can help deal with these issues (00:36:39).
- A descriptive analysis step can be performed before pre-processing
to identify potential problems with the data (00:37:00).
Singular Value Decomposition (SVD) and Matrix Reconstruction
- Centering and standardization are common pre-processing techniques,
and for the examples discussed, the data will be column-centered and
standardized (00:38:10).
- The preprocessed data will be represented as y, which is an N x P
matrix, where N is the number of cases, and P is the number of variables
(00:38:50).
- The minimum of N and P will be represented as K (00:39:04).
- The minimum of N and P is denoted as K, which is the smaller of the
two values (00:39:11).
- The Singular
value decomposition (SVD) is applied to matrix Y, decomposing it
into three new matrices: U, V, and D (00:39:21).
- U represents the row singular vectors corresponding to cases, while
V represents the column singular vectors corresponding to variables (00:39:38).
- D is a diagonal matrix containing the Singular value,
which are ordered and non-negative (00:40:44).
- U and V are orthogonal matrices, meaning their columns are
perpendicular to each other (00:40:02).
- The singular values in D are ordered, with d1 being greater than or
equal to d2, and so on, down to Dk (00:40:47).
- The idea of reconstructing a matrix using rank-one layers is
discussed, where the singular value is multiplied by the corresponding
row and column vectors (00:41:28).
- The reconstruction can be done by adding layers, but it’s possible
to stop at a certain point, denoted as K*, which is less than the total
number of layers (00:42:02).
- The goal is to find a decent approximation of the original matrix,
denoted as Y*, without using all the layers (00:42:22).
- In Principal
component analysis, it’s generally hoped that most of the Singular value
are small, with only a few being large (00:43:01).
- The associated singular vectors or scaled versions of them are
studied, which can be obtained by multiplying the singular vectors by
the singular values (00:43:17).
- Reconstruction can be thought of as the product of the singular
value, the corresponding row vector, and the transpose of the
corresponding column vector (00:43:41).
- Normalization of vectors involves stretching or shrinking them, and
different normalizations can be used, such as multiplying by a constant
or taking the square root of the vectors (00:44:18).
- The choice of normalization depends on how one wants to focus on the
data, and it’s essential to understand the different normalizations when
reading discussions about PCA (00:44:43).
Interpreting PCA Results: Scree Plots and Variance Explained
- Standard plot tools used with Principal
component analysis include scree plots, which help determine how
many Singular
value to keep (00:45:25).
- A scree plot is a graphical representation of the singular values,
which are plotted in descending order, and it helps identify the point
at which the values drop off significantly (00:45:48).
- The goal of a scree plot is to determine how many singular values to
keep, and there are various methods for doing so, including looking for
a clear drop-off point or using a threshold value (00:46:15).
- The percent variance explained is a measure of the proportion of the
total variance explained by the kept singular values, and it’s
calculated as the sum of the squared singular values divided by the
total sum of the squares (00:47:34).
- The percent variance explained is similar to R-squared and provides
a way to evaluate the quality of the reconstruction (00:47:43).
- The reconstruction formula can be used to set the number of singular
values to keep, and the rest are set to zero (00:47:20).
- The concept of variance explained is discussed, where it’s mentioned
that if 90% of the variance is accounted for, the variance explained
would be 90%, similar to the idea of r-squared (00:48:29).
- When looking at Singular
value decomposition output, typically, variance accounted for by
different dimensions and the total One-dimensional
space is seen (00:48:46).
- The focus is narrowed down to the most useful outputs, including
scree plots, which show how many Singular value
to keep (00:49:00).
- It’s emphasized that singular values with very small values can
still be useful, especially when trying to figure out what variables are
associated with them (00:49:32).
- The goal of an analysis might be to understand what’s happening with
the small singular values, and it’s suggested to ask why they aren’t
associated with reconstructing the data well (00:50:01).
- Generally, the first few strongest singular values, usually the
first 2 or 3, are looked at, but sometimes it’s necessary to look at the
smaller ones to understand certain variables (00:50:24).
Radial Plots (Circle Plots) and Variable Interpretation
- Another output is the radio plot, also known as a circle plot, which
shows the axes as the dimensions, representing the coordinate system
found by principal components (00:50:59).
- The axes in the radio plot represent the optimal coordinate system,
and the X’s represent the variables (00:52:18).
- Principal
component analysis (PCA) provides a map that shows how associated
each variable is with a One-dimensional
space, and the angles between variables or axes give an idea of
their association, with the length of the vector indicating how well
approximated it is by the coordinate system (00:52:45).
- In a 2-coordinate system with 6 variables, if some variables have
short vectors, it means they are not well represented by the coordinate
system, and the system is primarily driven by the variables with longer
vectors (00:53:37).
- The percent variance explained by each dimension is usually
provided, with the first dimension explaining more variance than the
second dimension (00:54:10).
- Angles between vectors indicate the association between variables,
with smaller angles indicating stronger associations and wider angles
indicating weaker associations (00:54:28).
- In a radial plot, vector lengths are always less than or equal to
one, and the magnitude of the vectors is represented by their proximity
to a circle with a radius of one (00:55:22).
- Radial plots are focused on variables and are useful for
understanding how the columns of a matrix relate to each other (00:56:28).
Case Plots and Clustering
- In addition to radial plots, Principal
component analysis also provides scatterplots of row scores, which
are case plots that show the relationships between individual cases (00:57:03).
- Principal Component Analysis (PCA) can be used to cluster groups in
a dataset, allowing for the identification of shared characteristics
among cases, such as demographic information or other distinguishing
features (00:57:20).
- By clustering cases after PCA, researchers can gain a better
understanding of how the cases relate to each other and identify
patterns that may not be immediately apparent (00:58:21).
- PCA is commonly used in genetic studies and epigenetics, where it
can help identify differences in gene expression between groups with
different characteristics, such as diet (00:58:42).
- In a study on chickens with different diets, PCA was used to reduce
a large number of variables into a smaller range of variability,
allowing for visualization of the data and identification of distinct
regions in the principal component space corresponding to different
diets (00:58:55).
Algebraic Interpretation of PCA
- Algebraically, Principal
component analysis involves finding the optimal set of axes that
capture the most variance in the data, with the first principal
component being proportional to the sum of the variables and the second
component being proportional to the difference between the variables (00:59:55).
- The squared Singular value
of the PCA can be used to determine the amount of variance explained by
each component, with the first singular value squared being equal to 1
plus the correlation between the variables, and the second singular
value squared being equal to 1 minus the correlation (01:00:05).
Identifying Subgroups and Visualizing Relationships
- In a real-world example, PCA can be used to identify subgroups in a
dataset, such as men and women, and to visualize the relationships
between cases (00:57:41).
- There are statistical methods available to isolate subgroups in a
dataset, but the specific method used will depend on the research
question and the characteristics of the data (01:01:52).
- Cluster analysis is one method that can be used to differentiate
between groups, but the example being discussed uses an existing
variable to label males and females differently, allowing for the
identification of who’s who in a Scatter plot (01:01:56).
- The scatter plot will show who’s male and who’s female, and this is
not directly connected to other methods, but rather one of many ways to
make these choices (01:02:43).
- The method being used is related to what was learned about distances
in the last lecture, specifically rotating to the Mahalanobis
distance space (01:03:21).
- The Mahalanobis space represents different distances, including
angular distance, which is a measure of how different variables are from
each other (01:03:39).
- Angular distance can be used to quantify the difference between
variables, but it’s often not necessary and can be eyeballed by looking
at how close or far apart the vectors are (01:03:48).
- The method being used represents all the various interpoint
distances that exist in the data, which will be explicitly worked on in
multi-dimensional scaling (Multidimensional
scaling) (01:04:36).
- MDS will use all the interpoint distances to represent the data, and
cluster analysis will be revisited in more detail later (01:05:00).
- The radial plot is used to think about the columns and how different
they are from each other in a correlation coefficient kind of way,
looking at the variables rather than individual data points (01:05:51).
- Radial plots are used for plotting correlations, while case plots
are used for plotting individual data sets, and the correlation
determines the axis in a Scatter plot (01:06:18).
- Studying variables and cases can be done by looking across rows or
down columns, with radial plots looking down columns and case plots
looking across rows (01:06:55).
- Variables and cases are not directly represented on the same plot,
and some programs may misrepresent them together, which can be visually
overwhelming and misleading (01:07:15).
- FactoMineR does not represent variables and cases together unless
forced to, as it keeps them separate to avoid visual overwhelm and
misleading conclusions (01:07:24).
Heads Data Set: Analysis and Interpretation
- The heads data set is used as an example, which includes 200 male
and 59 female participants, with the number of females being too low for
good conclusions (01:08:12).
- Descriptive statistics, including means and standard deviations, are
calculated for the data set, showing clear group differences between
males and females (01:08:55).
- Correlation coefficients are calculated, showing positive
correlations for males and some negative correlations for females, which
could be driven by outliers (01:09:27).
- Scatter
plot matrices are used to visualize the data, showing marginal
distributions and ellipses capturing 50% and 90% of the data (01:10:11).
- The initial examination of the data involves looking for outliers in
the negative correlations, but in this case, no outliers are found (01:10:51).
- The standard deviations of the variables are fairly close, so
centering and standardizing everything is done to make the variables
comparable (01:11:33).
- Centering and standardizing are necessary because the variables need
to be comparable to each other in some basic sense to be analyzed (01:11:47).
- The PCA function from the facto.miner package is used to run Principal
component analysis (PCA) (01:12:14).
- The PCA function generates a case plot and a radial plot by default
(01:12:34).
- The radial plot has a circle of radius 0 because the vectors are
essentially correlations, which cannot exceed 1 or be less than -1 (01:12:49).
- The first two dimensions account for about 63% of the variance, but
it is unclear if this is enough variance (01:13:27).
- The first two principal components are examined by default, but it
may be necessary to look more deeply at additional components (01:13:39).
- The variables Bam, Mfb, Ltg, and Ltn are strongly correlated with
each other and with the first One-dimensional
space, which may represent gross size or shape (01:14:22).
- The second dimension is defined by the variables Elgan (length from
glabella to apex nasion) and Tfh (true facial height) (01:14:51).
- Principal
component analysis has its downsides, particularly when compared to
methods like exploratory factor analysis, which focuses more on
individual variables rather than an optimal representation of all data
(01:15:33).
- If the goal is to understand variables on a more individual basis,
running a factor analysis might be a better alternative (01:15:56).
- PCA is related to factor analysis, but they serve different purposes
(01:16:05).
- In the given example, dimension one and dimension two are unlabeled
and represent a Scatter plot of
200 males in the sample, with no clear pattern visible (01:16:22).
- The plot can be used to look for potential clusters or separation in
the data, but in this case, no clear separation is visible (01:17:05).
PCA Output and Interpretation
- Principal
component analysis generates a lot of tabular output, including a
summary method that provides three basic groups of information (01:17:40).
- The summary method includes eigenvalues, which are squared Singular value
used to calculate the proportion of variance explained (01:18:08).
- The squared singular values are used to calculate the percentage of
variance accounted for by each One-dimensional
space, with the values dropping off as the dimensions increase (01:18:37).
- The rate at which the values drop off can indicate the importance of
each dimension, with a faster drop-off indicating a more significant
reduction in variance explained (01:18:49).
- The importance of getting a good fit depends on the specific study
or application, and may require creating prototypes and testing them to
determine the optimal level of fit (01:18:58).
- The output also includes information on individuals, but the
specific abbreviation used is unclear and requires further investigation
(01:19:32).
- The return list of a function contains all the information returned
by that function, such as coefficients and standard errors, similar to
running a regression (01:19:58).
- The function provides coordinates, squared cosines, and
contributions, which can be used to analyze the data (01:20:44).
- The function can display information about individuals, but it is
limited to the first 10 people by default, and using the dollar sign can
retrieve information about all individuals (01:20:50).
- The summary of the function is designed to reduce the amount of
information provided, making it less overwhelming (01:21:23).
- The book “Exploratory Methods” by Hasan, et al. describes the
methods and measures used in the function (01:21:32).
- The function provides a summary of the coordinates of each
individual and variable, as well as the distance, although the specific
meaning of the distance is unclear (01:22:09).
- Cosine squared is a measure of how well represented a point is by a
given One-dimensional
space, with small values indicating poor representation (01:23:09).
- Guidelines for interpreting the numbers include using small values
to indicate trivial representation, and larger values to indicate better
representation (01:23:29).
- Contribution plots can be used to visualize which points are
contributing to the different dimensions (01:24:07).
- The cases plot makes use of the information provided by the
function, including the coordinates and contributions (01:24:41).
- The Mahalanobis
distance is mentioned, but its exact meaning in this context is
unclear, with a note of caution about the interpretation (01:24:46).
- The cases plot is not frequently used, and instead, cluster analysis
is preferred for analyzing the data (01:25:17).
- The variables have a One-dimensional
space score, which corresponds to the coordinates in the radial plot
(01:25:45).
- The contribution of a variable measures how much it contributes to
the overall dimension (01:25:58).
- The term “contribution” is clarified, noting that it might be
confused with “center” due to the abbreviation used (01:26:14).
- Cosine squared is an r-squared value that measures the relationship
between a variable and a dimension (01:26:17).
- The cosine squared values for variables Ltn and Ltg with respect to
Dimension 1 are 0.618 and 0.645, respectively (01:26:48).
- The contribution values are related to the cosine squared values,
indicating how much each variable defines a dimension (01:26:55).
Factoextra Package and Biplots
- Jay Verkuilen introduces the factoextra package, which can be used
to extract additional information from Principal
component analysis results, but notes that he hasn’t used it much
due to limited online examples and the author’s focus on selling a book
(01:30:28).
- Adam suggests that using a different package to create and analyze
PCA results might cause compatibility issues (01:31:16).
- Jay Verkuilen attempts to use the fviz function from the factoextra
package to generate additional plots, including a Biplot that overlays the
variables and cases plots (01:32:35).
- Jay Verkuilen expresses his personal preference for separate radial
and cases plots over the biplot, citing concerns that the biplot can be
misleading and overwhelming (01:33:04).
- Jay Verkuilen finds the biplot too complicated and difficult to
interpret, with too much information on one plot, and prefers to think
about the cases in a different way (01:34:08).
- Journal editors might prefer the biplot because it takes up less
space in a journal, but Jay Verkuilen personally finds it more
understandable to use separate plots (01:34:55).
- Jay Verkuilen thinks that the variability on the x-axis and y-axis
should be proportional to the amount of variability accounted for by
each One-dimensional
space, and the biplot is constructed to respect this (01:35:42).
- The biplot tries to prevent inferential mistakes by ensuring that
the axes are respected and the variability is accurately represented (01:36:14).
- Jay Verkuilen prefers to use the radio plot and the cases plot
instead of the Biplot
because he finds it easier to understand and less confusing (01:36:38).
- The biplot can be thought of as a regression of all variables on the
dimensions, but it’s not a traditional regression and can be confusing
(01:37:00).
- Jay Verkuilen didn’t use the biplot in his example because he didn’t
like it, but he doesn’t always remember the reasons behind his decisions
(01:37:09).
Comparing Male and Female Data
- The summary and dim desk for the ladies will be similar, showing a
configuration of variables with fewer women, which is already known (01:37:49).
- Jay Verkuilen suggests comparing the configuration of variables for
men and women using the radio plot (01:38:12).
- When comparing males and females, it’s essential to look at the
graphics in the right order to see things that should be compared to
each other, such as the configuration of points (01:38:14).
- The signs of dimensions are arbitrary, meaning the plot might get
mirror-imaged or flipped upside down, and there’s no difference because
the signs are indeterminate (01:39:01).
- If the males and females have different orientations in space, it
may be necessary to flip the data to compare them properly, and there
are automated tools like Procrustes analysis
to make this process less tedious (01:39:57).
- When comparing the males, it’s possible to see which variables tend
to hang together, such as Elgan and Tfh, and Bam and Mfb (01:40:16).
- The data suggests that men’s faces are characterized more by overall
size and some shape differences, while women’s faces seem to be
characterized more equally by overall size and shape differences (01:41:13).
Data Understanding and Dimensionality
- It’s essential to have a deep understanding of the data, including
photographs of the subjects, to really understand what’s happening and
not just rely on the analysis (01:41:48).
- The summary of the data can provide a quick overview, but it’s
crucial to look beyond the summary to gain a deeper understanding of the
results (01:42:13).
- To achieve 90% understanding, 5 dimensions are required for the
general population, whereas 4 dimensions are sufficient for males (01:42:27).
- The complexity needed to understand the data depends on what needs
to be measured (01:42:40).
Facto.shiny: Interactive Exploration and Reporting
- The facto.shiny function can be used to generate a web page with
analysis and graphical options, allowing users to modify graphs and
explore the data (01:43:15).
- The web page provides various graphical options, including the
ability to plot different axes and view scatter plots (01:43:52).
- Users can manipulate graph parameters and zoom in on specific areas
of the graph (01:44:41).
- The facto.shiny function also allows for clustering and automatic
report generation, which can be useful for generating reports and saving
code (01:44:59).
- The automatic report generates a Word document with output
and a summary of findings, which can be a good starting point for
writing a report (01:45:29).
- The automatic report is not as good as a handwritten report, but it
is a useful option for generating reports quickly (01:45:47).
- Users can save the code used to generate the report, making it easy
to run multiple analyses and generate automated reports (01:46:02).
- The process of merging files and editing them down to what is needed
is more efficient than manually running and copying and pasting
everything, especially when running multiple analyses (01:46:34).
- The output options for the process can be customized, although the
specific details of the options are not recalled (01:46:49).
- Downloading graphics from the process is also an option, making it a
fantastic choice, especially when running multiple analyses for a
dissertation (01:47:11).
- Running multiple analyses can be a lot of work, but this process
provides a nice option to make it easier (01:47:22).
- The ability to generate supplementary output, such as screen plots,
is a useful feature that is hoped to be implemented in other software
and R packages (01:47:39).
- The factoextra package has a feature that allows for the generation
of nice output, which is a useful tool for interpreting analyses (01:47:56).
- The simple R package, or ecology of packages, has tools that can
help with interpreting analyses and generating good output (01:48:18).
Scree Plots and Determining the Number of Singular Values
- Screen plots will be discussed in more detail later in the semester
when talking about inference (01:48:40).
- The scree plot is generated using the factoextra package and shows
the actual, observed eigenvalues, which can be used to determine the
number of Singular value
to keep (01:49:04).
- The factoextra package has multiple methods for determining the
number of singular values to keep, including parallel analysis, optimal
coordinates, and acceleration factor (01:50:04).
- The parallel analysis method regenerates the eigenvalue greater than
one criterion, which states that if an eigenvalue is greater than one,
it accounts for more variability than one variable (01:50:15).
- Optimal coordinates in Principal
component analysis draw a line through the points, similar to a
regression line, to identify outliers and keepers, which are points
above the line (01:50:49).
- Bootstrapping in PCA involves running simulations from the data set
to obtain Eigenvalues or squared singular values, which can be used to
determine the number of factors to keep (01:51:14).
- The Eigenvalues obtained from bootstrapping can be used to assess
the number of factors to keep, with values greater than one indicating
important factors (01:51:41).
- The paper “end factors” describes various methods for determining
the number of factors to keep, including different criteria and rules of
thumb (01:52:50).
Supplementary Data and Group Comparisons
- Supplementary data in Principal
component analysis refers to points or variables that are not used
to compute the overall point cloud but are projected into the space,
allowing for the analysis of additional data (01:53:19).
- Supplementary points can be used to analyze the relationship between
different groups, such as males and females, by projecting one group
into the space defined by the other group (01:53:46).
- Variables can also be made supplementary, allowing for the analysis
of additional variables in the same space (01:54:26).
- Categories can be made supplementary as well, enabling the analysis
of categorical data in the same space (01:54:41).
- The plot of supplementary data can show the relationship between
different groups or variables, with the black dots representing the
original data points and the blue dots representing the supplementary
points (01:55:10).
- The representation of females in the space defined by the male
soldiers is being examined to see how well they are represented by the
existing system, with the ideal outcome being a fairly overlapping
distribution of female cases within the general vicinity of the male
cases (01:55:14).
- However, it is observed that there are many female cases that are
not appearing within the point cloud, indicating that the existing
system may not be suitable for females (01:55:55).
- A similar analysis can be applied to the Parkinson’s data by
comparing the controls and clinical cases to assess how well the
clinical cases are represented by the system defined by the controls (01:56:06).
- Visualizing the data can be a powerful tool to understand the
overlap between different groups, such as the black dots and blue dots,
and to determine how well a system that works for one group will work
for another (01:56:29).
- The lack of overlap between the black dots and blue dots suggests
that a system that works for the black dots may not be effective for the
blue dots (01:56:45).
- A report on this analysis might include a plot of the data using a
function such as hf.sub.dot.Principal
component analysis to visualize the probability ellipses around the
different groups (01:57:07).
- The plot ellipses function can be used to compute the overlap
between the different groups and to visualize how distinct they are from
each other (01:58:17).
- This type of analysis can be applied to supplementary data by
computing the PCA and using the formulas to project new cases into the
space (01:59:10).
- The space in a plot can be used to visualize supplementary
variables, which appear as dashed lines, allowing for an understanding
of how well a variable is represented by the existing structure (01:59:33).
- This method can be used to assess for outliers by identifying
individuals or data points that are not well-represented in the existing
space (02:00:15).
- Supplementary individuals can be visualized in the plot, showing
their distances from the center of the point cloud, which are likely Mahalanobis
distance distances (02:00:50).
Multi-Dimensional Scaling (MDS) and Future Topics