MVA Class 6 — PCA

Introduction and Session Overview

  • The session begins with Adam and Luofan Shu discussing the attendance of other participants, including Peter and Microsoft Azure, and confirming that Professor Jay Verkuilen is present (00:02:57).
  • Jay Verkuilen starts the session, mentioning that he knows at least one or two people won’t be attending, and decides to get started (00:04:31).
  • The session is a continuation of previous discussions on preliminary topics, including descriptive statistics, which Jay Verkuilen believes are undervalued and should be done more often before proceeding with other analyses (00:04:44).
  • Jay Verkuilen emphasizes the importance of checking for improperly coded data or other issues before proceeding with analysis (00:05:06).

Principal Component Analysis (PCA): Introduction and Historical Context

  • The main topic of the session is Principal Components Analysis (Principal component analysis), which is arguably the first multivariate method and was proposed by Karl Pearson in the early 20th century (00:05:25).
  • PCA was developed in the late 19th and early 20th centuries, around the same time as multiple linear regression, which was invented in 1897 (00:05:46).
  • Jay Verkuilen notes that many statistical methods were initially proposed as separate techniques, but were later found to be special cases of other methods, such as Analysis of variance being a special case of multiple regression (00:06:04).
  • Carl Pearson proposed PCA in the early 1900s, around the same time that the concept of eigenvalues was being developed (00:06:42).
  • Jay Verkuilen mentions that the development of PCA was facilitated by Pearson’s wealth and access to computational resources, as well as the work of his contemporaries, including Francis Galton (00:07:15).
  • The era in which PCA was developed was marked by scientific racism, and many of the key figures in the field held problematic views by modern standards (00:07:55).
  • The concept of principal components analysis (Principal component analysis) was initially proposed by Karl Pearson but didn’t get its name until the 1930s when it was named by statistician and economist Harold Hotelling, who also devised canonical correlation (00:08:52).
  • Harold Hotelling is credited with devising principal components analysis and canonical correlation, with the latter being related to discriminant analysis (00:09:03).
  • Canonical correlation is a complex topic that can be difficult for many people to understand, leading some to avoid it (00:09:26).
  • Many classical multivariate statistical methods, including PCA and factor analysis, were devised in the first half of the 20th century (00:09:43).
  • Although these methods were developed in the early 20th century, it wasn’t until the advent of computers that they became practical (00:10:06).

Data Visualization and Dimensionality Reduction in Multivariate Statistics

  • The Swiss faces data set, which has six outcome variables, was previously analyzed using Principal component analysis, revealing a size One-dimensional space and a shape dimension (00:10:47).
  • The data set has six variables (P = 6), making it impossible to visualize in six-dimensional space, which is a common challenge in multivariate statistics (00:11:47).
  • One of the primary goals of multivariate statistics is to find an optimal way to reduce higher-dimensional data sets to a smaller dimensional space, making it easier to analyze and visualize (00:12:04).
  • The idea is to find the best combination of variables to reduce the data set from six dimensions to a lower number, such as two dimensions (K* = 2), while preserving the most important information (00:12:38).
  • When dealing with high-dimensional data, it can be challenging to visualize and understand the relationships between variables, especially when K equals 6 or more, but there are ways to approximate and trick the system into making it more manageable (00:13:03).

Optimal Data Reduction and Variable Selection

  • One approach to handling high-dimensional data is to use optimal data reduction, which can provide information about the importance of certain variables and potentially eliminate unnecessary ones (00:13:40).
  • If certain variables do not appear in the optimal projection, it may be possible to remove them, reducing the dimensionality of the data set, as seen in the example where the last two variables are eliminated, leaving a 4-dimensional data set (00:14:01).
  • Principal component analysis can help identify combinations of variables that are important, such as overall size and shape dimensions, which were found in a previous analysis (00:14:19).
  • The overall size One-dimensional space makes sense, as larger people tend to have larger heads, while the shape dimension refers to the proportion of size, such as longer or wider (00:14:45).

Data Pre-processing for PCA

  • A Scatter plot of two variables, Ltn (ear to bridge of nose) and Ltg (ear to tip of the chin), shows a typical relationship between the two, with no unusual patterns (00:15:29).
  • The scatter plot represents the depth of the ear compared to the front of the face, which is relevant for applications such as mask-making (00:16:02).
  • When performing Principal component analysis, it is common to center the variables by subtracting the mean, and standardize them to have the same scale, usually by dividing by the standard deviation (00:16:52).
  • The data is initially in millimeters, but it can be converted to standard deviations by dividing by a constant, which is a matter of shifting the units, and the range of the data is roughly the same, so standardization might not be necessary, but it will be done anyway (00:17:43).

PCA: Geometric Interpretation and Singular Value Decomposition (SVD)

  • Principal Components Analysis (PCA) geometrically involves rotating the axes to find the best axis, which goes through the mean and down the middle of the data cloud, and then finding a second axis that is perpendicular to the first one (00:18:36).
  • The first axis is called PC1, and the second axis is called PC2, and if there were more variables, the additional axes would be perpendicular to each other (00:19:24).
  • Principal component analysis is similar to Singular value decomposition (SVD), which finds vectors that are the strongest, next strongest, and so on, and are all perpendicular to each other (00:20:09).
  • The goal of PCA is to find a new coordinate system that is optimal, meaning that if only one One-dimensional space could be chosen to represent the data, it would be PC1, which captures the most variability (00:21:02).
  • If a second dimension is chosen, it would be the one that is maximally uncorrelated with the first PC, which would capture any remaining information (00:21:36).
  • Principal Components Analysis (PCA) is a method used to find the components that account for the most variability in a dataset, with the first principal component explaining the most variability, the second explaining the next most, and so on (00:21:52).
  • In a two-dimensional dataset, the first principal component will always look like the average of the two dimensions, and the second principal component will look like the difference between the two dimensions (00:22:42).
  • In higher-dimensional datasets, more complicated patterns can emerge, but in two dimensions, the principal components will always be perpendicular to each other (00:22:55).
  • The math of Singular value decomposition (SVD) provides the perpendicularity relationships between the principal components (00:23:52).
  • Principal component analysis is an analysis of interdependence, where all variables are peers of each other, and there is no outcome variable being predicted (00:25:00).
  • PCA is an unsupervised learning method, meaning that it is not used to predict a specific outcome variable, but rather to understand the interrelationships between all variables (00:25:18).
  • Other unsupervised learning methods that will be covered in class include correspondence analysis, Multidimensional scaling, and clustering (00:25:55).
  • PCA is also known by other names, such as the method named after Karl Pearson, Harold Hotelling, and Loeve (00:24:40).

Supervised vs. Unsupervised Learning

  • The terminology used in Principal component analysis can be inconsistent, with different areas using different names to describe the same method (00:24:23).
  • The supervised learning method to be used is discriminant analysis, where there is an outcome variable, as opposed to unsupervised learning, which has no outcome variable and involves analyzing multiple things at once (00:26:06).

Pre-processing Techniques for PCA

  • Principal Components Analysis (PCA) is the first multivariate analysis method, invented by Karl Pearson in the 19th century and later cleaned up by Harold Hotelling (00:26:54).
  • PCA requires pre-processing, which involves thinking about the data in the matrix, and this concept will be encountered again in correspondence analysis (00:28:09).
  • Pre-processing for Principal component analysis typically involves column centering and standardizing the data, which means subtracting the means from every column and dividing each column by its standard deviation (00:28:27).
  • The effect of column centering and standardizing is to remove overall level and units, and this is the default in most programs, including factoMineR (00:28:54).
  • Another choice for pre-processing is row centering, which subtracts the row mean out, and this can be legitimate or not, depending on whether the means of the variables are comparable (00:29:55).
  • Row centering removes individual differences, such as when the data has a row-specific effect, and can make sense in certain situations (00:30:19).
  • When analyzing data from a class, such as spelling, math, and social studies percentages for 6th graders, it’s essential to consider that scores may not be directly comparable due to individual differences, like smarter kids performing better overall (00:30:46).
  • To address this issue, subtracting the mean of each case can help eliminate individual row differences, removing the “smarter kid effect” and allowing for a focus on other patterns, such as subject-specific performance (00:30:55).
  • This technique is also applied in bio data analysis, where the first component is often size, but researchers are more interested in shape differences, such as changes in body shape across different age groups (00:31:52).
  • Row centering can be used to remove overall size differences and focus on other patterns, such as developmental changes in body shape (00:32:23).
  • Double centering or double standardization involves subtracting both row and column means, which focuses on interaction and will be discussed further in the context of correspondence analysis (00:32:43).
  • Other normalization techniques, such as term frequency inverse document frequency (Tf–idf), can be used in text analysis to identify words that are unique to a particular document but not common across all documents (00:33:24).
  • TF-IDF takes into account the frequency of terms within a document and their rarity across all documents, making it a useful technique for text analysis (00:34:01).
  • Alternative normalization methods, such as taking the natural log of all values, can also be employed, depending on the specific research question and data characteristics (00:33:38).
  • In text analysis, certain words known as “stop words” are often discarded because they are too common and provide no useful information, examples of which include words like “a” and “the” (00:34:56).
  • The TF-IDF (Term Frequency-Inverse Document Frequency) method is an example of a normalization technique that down-weights common words, making it a useful pre-processing step (00:35:20).
  • Pre-processing is a crucial step that should be considered in advance, as different techniques can significantly impact the results of an analysis (00:35:49).

Missing Data and Descriptive Analysis

  • Correspondence analysis uses a different pre-processing technique than PCA (Principal component analysis), which can lead to more meaningful results in certain cases (00:36:07).
  • It is essential to check for Missing data and potential problems during the pre-processing stage, and there are packages available, such as the missing data package for facto minor, that can help deal with these issues (00:36:39).
  • A descriptive analysis step can be performed before pre-processing to identify potential problems with the data (00:37:00).

Singular Value Decomposition (SVD) and Matrix Reconstruction

  • Centering and standardization are common pre-processing techniques, and for the examples discussed, the data will be column-centered and standardized (00:38:10).
  • The preprocessed data will be represented as y, which is an N x P matrix, where N is the number of cases, and P is the number of variables (00:38:50).
  • The minimum of N and P will be represented as K (00:39:04).
  • The minimum of N and P is denoted as K, which is the smaller of the two values (00:39:11).
  • The Singular value decomposition (SVD) is applied to matrix Y, decomposing it into three new matrices: U, V, and D (00:39:21).
  • U represents the row singular vectors corresponding to cases, while V represents the column singular vectors corresponding to variables (00:39:38).
  • D is a diagonal matrix containing the Singular value, which are ordered and non-negative (00:40:44).
  • U and V are orthogonal matrices, meaning their columns are perpendicular to each other (00:40:02).
  • The singular values in D are ordered, with d1 being greater than or equal to d2, and so on, down to Dk (00:40:47).
  • The idea of reconstructing a matrix using rank-one layers is discussed, where the singular value is multiplied by the corresponding row and column vectors (00:41:28).
  • The reconstruction can be done by adding layers, but it’s possible to stop at a certain point, denoted as K*, which is less than the total number of layers (00:42:02).
  • The goal is to find a decent approximation of the original matrix, denoted as Y*, without using all the layers (00:42:22).
  • In Principal component analysis, it’s generally hoped that most of the Singular value are small, with only a few being large (00:43:01).
  • The associated singular vectors or scaled versions of them are studied, which can be obtained by multiplying the singular vectors by the singular values (00:43:17).
  • Reconstruction can be thought of as the product of the singular value, the corresponding row vector, and the transpose of the corresponding column vector (00:43:41).
  • Normalization of vectors involves stretching or shrinking them, and different normalizations can be used, such as multiplying by a constant or taking the square root of the vectors (00:44:18).
  • The choice of normalization depends on how one wants to focus on the data, and it’s essential to understand the different normalizations when reading discussions about PCA (00:44:43).

Interpreting PCA Results: Scree Plots and Variance Explained

  • Standard plot tools used with Principal component analysis include scree plots, which help determine how many Singular value to keep (00:45:25).
  • A scree plot is a graphical representation of the singular values, which are plotted in descending order, and it helps identify the point at which the values drop off significantly (00:45:48).
  • The goal of a scree plot is to determine how many singular values to keep, and there are various methods for doing so, including looking for a clear drop-off point or using a threshold value (00:46:15).
  • The percent variance explained is a measure of the proportion of the total variance explained by the kept singular values, and it’s calculated as the sum of the squared singular values divided by the total sum of the squares (00:47:34).
  • The percent variance explained is similar to R-squared and provides a way to evaluate the quality of the reconstruction (00:47:43).
  • The reconstruction formula can be used to set the number of singular values to keep, and the rest are set to zero (00:47:20).
  • The concept of variance explained is discussed, where it’s mentioned that if 90% of the variance is accounted for, the variance explained would be 90%, similar to the idea of r-squared (00:48:29).
  • When looking at Singular value decomposition output, typically, variance accounted for by different dimensions and the total One-dimensional space is seen (00:48:46).
  • The focus is narrowed down to the most useful outputs, including scree plots, which show how many Singular value to keep (00:49:00).
  • It’s emphasized that singular values with very small values can still be useful, especially when trying to figure out what variables are associated with them (00:49:32).
  • The goal of an analysis might be to understand what’s happening with the small singular values, and it’s suggested to ask why they aren’t associated with reconstructing the data well (00:50:01).
  • Generally, the first few strongest singular values, usually the first 2 or 3, are looked at, but sometimes it’s necessary to look at the smaller ones to understand certain variables (00:50:24).

Radial Plots (Circle Plots) and Variable Interpretation

  • Another output is the radio plot, also known as a circle plot, which shows the axes as the dimensions, representing the coordinate system found by principal components (00:50:59).
  • The axes in the radio plot represent the optimal coordinate system, and the X’s represent the variables (00:52:18).
  • Principal component analysis (PCA) provides a map that shows how associated each variable is with a One-dimensional space, and the angles between variables or axes give an idea of their association, with the length of the vector indicating how well approximated it is by the coordinate system (00:52:45).
  • In a 2-coordinate system with 6 variables, if some variables have short vectors, it means they are not well represented by the coordinate system, and the system is primarily driven by the variables with longer vectors (00:53:37).
  • The percent variance explained by each dimension is usually provided, with the first dimension explaining more variance than the second dimension (00:54:10).
  • Angles between vectors indicate the association between variables, with smaller angles indicating stronger associations and wider angles indicating weaker associations (00:54:28).
  • In a radial plot, vector lengths are always less than or equal to one, and the magnitude of the vectors is represented by their proximity to a circle with a radius of one (00:55:22).
  • Radial plots are focused on variables and are useful for understanding how the columns of a matrix relate to each other (00:56:28).

Case Plots and Clustering

  • In addition to radial plots, Principal component analysis also provides scatterplots of row scores, which are case plots that show the relationships between individual cases (00:57:03).
  • Principal Component Analysis (PCA) can be used to cluster groups in a dataset, allowing for the identification of shared characteristics among cases, such as demographic information or other distinguishing features (00:57:20).
  • By clustering cases after PCA, researchers can gain a better understanding of how the cases relate to each other and identify patterns that may not be immediately apparent (00:58:21).
  • PCA is commonly used in genetic studies and epigenetics, where it can help identify differences in gene expression between groups with different characteristics, such as diet (00:58:42).
  • In a study on chickens with different diets, PCA was used to reduce a large number of variables into a smaller range of variability, allowing for visualization of the data and identification of distinct regions in the principal component space corresponding to different diets (00:58:55).

Algebraic Interpretation of PCA

  • Algebraically, Principal component analysis involves finding the optimal set of axes that capture the most variance in the data, with the first principal component being proportional to the sum of the variables and the second component being proportional to the difference between the variables (00:59:55).
  • The squared Singular value of the PCA can be used to determine the amount of variance explained by each component, with the first singular value squared being equal to 1 plus the correlation between the variables, and the second singular value squared being equal to 1 minus the correlation (01:00:05).

Identifying Subgroups and Visualizing Relationships

  • In a real-world example, PCA can be used to identify subgroups in a dataset, such as men and women, and to visualize the relationships between cases (00:57:41).
  • There are statistical methods available to isolate subgroups in a dataset, but the specific method used will depend on the research question and the characteristics of the data (01:01:52).
  • Cluster analysis is one method that can be used to differentiate between groups, but the example being discussed uses an existing variable to label males and females differently, allowing for the identification of who’s who in a Scatter plot (01:01:56).
  • The scatter plot will show who’s male and who’s female, and this is not directly connected to other methods, but rather one of many ways to make these choices (01:02:43).
  • The method being used is related to what was learned about distances in the last lecture, specifically rotating to the Mahalanobis distance space (01:03:21).
  • The Mahalanobis space represents different distances, including angular distance, which is a measure of how different variables are from each other (01:03:39).
  • Angular distance can be used to quantify the difference between variables, but it’s often not necessary and can be eyeballed by looking at how close or far apart the vectors are (01:03:48).
  • The method being used represents all the various interpoint distances that exist in the data, which will be explicitly worked on in multi-dimensional scaling (Multidimensional scaling) (01:04:36).
  • MDS will use all the interpoint distances to represent the data, and cluster analysis will be revisited in more detail later (01:05:00).
  • The radial plot is used to think about the columns and how different they are from each other in a correlation coefficient kind of way, looking at the variables rather than individual data points (01:05:51).
  • Radial plots are used for plotting correlations, while case plots are used for plotting individual data sets, and the correlation determines the axis in a Scatter plot (01:06:18).
  • Studying variables and cases can be done by looking across rows or down columns, with radial plots looking down columns and case plots looking across rows (01:06:55).
  • Variables and cases are not directly represented on the same plot, and some programs may misrepresent them together, which can be visually overwhelming and misleading (01:07:15).
  • FactoMineR does not represent variables and cases together unless forced to, as it keeps them separate to avoid visual overwhelm and misleading conclusions (01:07:24).

Heads Data Set: Analysis and Interpretation

  • The heads data set is used as an example, which includes 200 male and 59 female participants, with the number of females being too low for good conclusions (01:08:12).
  • Descriptive statistics, including means and standard deviations, are calculated for the data set, showing clear group differences between males and females (01:08:55).
  • Correlation coefficients are calculated, showing positive correlations for males and some negative correlations for females, which could be driven by outliers (01:09:27).
  • Scatter plot matrices are used to visualize the data, showing marginal distributions and ellipses capturing 50% and 90% of the data (01:10:11).
  • The initial examination of the data involves looking for outliers in the negative correlations, but in this case, no outliers are found (01:10:51).
  • The standard deviations of the variables are fairly close, so centering and standardizing everything is done to make the variables comparable (01:11:33).
  • Centering and standardizing are necessary because the variables need to be comparable to each other in some basic sense to be analyzed (01:11:47).
  • The PCA function from the facto.miner package is used to run Principal component analysis (PCA) (01:12:14).
  • The PCA function generates a case plot and a radial plot by default (01:12:34).
  • The radial plot has a circle of radius 0 because the vectors are essentially correlations, which cannot exceed 1 or be less than -1 (01:12:49).
  • The first two dimensions account for about 63% of the variance, but it is unclear if this is enough variance (01:13:27).
  • The first two principal components are examined by default, but it may be necessary to look more deeply at additional components (01:13:39).
  • The variables Bam, Mfb, Ltg, and Ltn are strongly correlated with each other and with the first One-dimensional space, which may represent gross size or shape (01:14:22).
  • The second dimension is defined by the variables Elgan (length from glabella to apex nasion) and Tfh (true facial height) (01:14:51).
  • Principal component analysis has its downsides, particularly when compared to methods like exploratory factor analysis, which focuses more on individual variables rather than an optimal representation of all data (01:15:33).
  • If the goal is to understand variables on a more individual basis, running a factor analysis might be a better alternative (01:15:56).
  • PCA is related to factor analysis, but they serve different purposes (01:16:05).
  • In the given example, dimension one and dimension two are unlabeled and represent a Scatter plot of 200 males in the sample, with no clear pattern visible (01:16:22).
  • The plot can be used to look for potential clusters or separation in the data, but in this case, no clear separation is visible (01:17:05).

PCA Output and Interpretation

  • Principal component analysis generates a lot of tabular output, including a summary method that provides three basic groups of information (01:17:40).
  • The summary method includes eigenvalues, which are squared Singular value used to calculate the proportion of variance explained (01:18:08).
  • The squared singular values are used to calculate the percentage of variance accounted for by each One-dimensional space, with the values dropping off as the dimensions increase (01:18:37).
  • The rate at which the values drop off can indicate the importance of each dimension, with a faster drop-off indicating a more significant reduction in variance explained (01:18:49).
  • The importance of getting a good fit depends on the specific study or application, and may require creating prototypes and testing them to determine the optimal level of fit (01:18:58).
  • The output also includes information on individuals, but the specific abbreviation used is unclear and requires further investigation (01:19:32).
  • The return list of a function contains all the information returned by that function, such as coefficients and standard errors, similar to running a regression (01:19:58).
  • The function provides coordinates, squared cosines, and contributions, which can be used to analyze the data (01:20:44).
  • The function can display information about individuals, but it is limited to the first 10 people by default, and using the dollar sign can retrieve information about all individuals (01:20:50).
  • The summary of the function is designed to reduce the amount of information provided, making it less overwhelming (01:21:23).
  • The book “Exploratory Methods” by Hasan, et al. describes the methods and measures used in the function (01:21:32).
  • The function provides a summary of the coordinates of each individual and variable, as well as the distance, although the specific meaning of the distance is unclear (01:22:09).
  • Cosine squared is a measure of how well represented a point is by a given One-dimensional space, with small values indicating poor representation (01:23:09).
  • Guidelines for interpreting the numbers include using small values to indicate trivial representation, and larger values to indicate better representation (01:23:29).
  • Contribution plots can be used to visualize which points are contributing to the different dimensions (01:24:07).
  • The cases plot makes use of the information provided by the function, including the coordinates and contributions (01:24:41).
  • The Mahalanobis distance is mentioned, but its exact meaning in this context is unclear, with a note of caution about the interpretation (01:24:46).
  • The cases plot is not frequently used, and instead, cluster analysis is preferred for analyzing the data (01:25:17).
  • The variables have a One-dimensional space score, which corresponds to the coordinates in the radial plot (01:25:45).
  • The contribution of a variable measures how much it contributes to the overall dimension (01:25:58).
  • The term “contribution” is clarified, noting that it might be confused with “center” due to the abbreviation used (01:26:14).
  • Cosine squared is an r-squared value that measures the relationship between a variable and a dimension (01:26:17).
  • The cosine squared values for variables Ltn and Ltg with respect to Dimension 1 are 0.618 and 0.645, respectively (01:26:48).
  • The contribution values are related to the cosine squared values, indicating how much each variable defines a dimension (01:26:55).

Automated Interpretation Tools and Hypothesis Testing

  • The automated interpretation tools in FactoMineR, such as the dimdesc function, provide r-squared values and correlations between variables and dimensions (01:28:10).
  • The dimdesc function gives an idea of how well-represented variables are by different dimensions and can be used for hypothesis testing (01:28:42).
  • The function can identify variables that are unimportant for a particular One-dimensional space, as seen in the example where two variables are deemed unimportant for Dimension 2 (01:29:01).
  • The importance of variables in Principal component analysis dimensionality reduction is discussed, with Jay Verkuilen mentioning that all six variables are important for Dimension 1, but the importance decreases for subsequent dimensions (01:29:13).
  • Jay Verkuilen expresses skepticism about trusting the hypotheses generated by PCA, citing concerns about the sampling scheme and data gathering methods (01:29:20).
  • The validity of hypothesis tests in PCA depends on the sampling scheme, but PCA is often applied to data where sampling doesn’t make sense (01:29:35).
  • Jay Verkuilen is unsure if the data is a representative sample, which affects the trustworthiness of the p-values (01:29:53).
  • The trustworthiness of p-values depends on the believability of the sampling scheme (01:30:04).

Factoextra Package and Biplots

  • Jay Verkuilen introduces the factoextra package, which can be used to extract additional information from Principal component analysis results, but notes that he hasn’t used it much due to limited online examples and the author’s focus on selling a book (01:30:28).
  • Adam suggests that using a different package to create and analyze PCA results might cause compatibility issues (01:31:16).
  • Jay Verkuilen attempts to use the fviz function from the factoextra package to generate additional plots, including a Biplot that overlays the variables and cases plots (01:32:35).
  • Jay Verkuilen expresses his personal preference for separate radial and cases plots over the biplot, citing concerns that the biplot can be misleading and overwhelming (01:33:04).
  • Jay Verkuilen finds the biplot too complicated and difficult to interpret, with too much information on one plot, and prefers to think about the cases in a different way (01:34:08).
  • Journal editors might prefer the biplot because it takes up less space in a journal, but Jay Verkuilen personally finds it more understandable to use separate plots (01:34:55).
  • Jay Verkuilen thinks that the variability on the x-axis and y-axis should be proportional to the amount of variability accounted for by each One-dimensional space, and the biplot is constructed to respect this (01:35:42).
  • The biplot tries to prevent inferential mistakes by ensuring that the axes are respected and the variability is accurately represented (01:36:14).
  • Jay Verkuilen prefers to use the radio plot and the cases plot instead of the Biplot because he finds it easier to understand and less confusing (01:36:38).
  • The biplot can be thought of as a regression of all variables on the dimensions, but it’s not a traditional regression and can be confusing (01:37:00).
  • Jay Verkuilen didn’t use the biplot in his example because he didn’t like it, but he doesn’t always remember the reasons behind his decisions (01:37:09).

Comparing Male and Female Data

  • The summary and dim desk for the ladies will be similar, showing a configuration of variables with fewer women, which is already known (01:37:49).
  • Jay Verkuilen suggests comparing the configuration of variables for men and women using the radio plot (01:38:12).
  • When comparing males and females, it’s essential to look at the graphics in the right order to see things that should be compared to each other, such as the configuration of points (01:38:14).
  • The signs of dimensions are arbitrary, meaning the plot might get mirror-imaged or flipped upside down, and there’s no difference because the signs are indeterminate (01:39:01).
  • If the males and females have different orientations in space, it may be necessary to flip the data to compare them properly, and there are automated tools like Procrustes analysis to make this process less tedious (01:39:57).
  • When comparing the males, it’s possible to see which variables tend to hang together, such as Elgan and Tfh, and Bam and Mfb (01:40:16).
  • The data suggests that men’s faces are characterized more by overall size and some shape differences, while women’s faces seem to be characterized more equally by overall size and shape differences (01:41:13).

Data Understanding and Dimensionality

  • It’s essential to have a deep understanding of the data, including photographs of the subjects, to really understand what’s happening and not just rely on the analysis (01:41:48).
  • The summary of the data can provide a quick overview, but it’s crucial to look beyond the summary to gain a deeper understanding of the results (01:42:13).
  • To achieve 90% understanding, 5 dimensions are required for the general population, whereas 4 dimensions are sufficient for males (01:42:27).
  • The complexity needed to understand the data depends on what needs to be measured (01:42:40).

Facto.shiny: Interactive Exploration and Reporting

  • The facto.shiny function can be used to generate a web page with analysis and graphical options, allowing users to modify graphs and explore the data (01:43:15).
  • The web page provides various graphical options, including the ability to plot different axes and view scatter plots (01:43:52).
  • Users can manipulate graph parameters and zoom in on specific areas of the graph (01:44:41).
  • The facto.shiny function also allows for clustering and automatic report generation, which can be useful for generating reports and saving code (01:44:59).
  • The automatic report generates a Word document with output and a summary of findings, which can be a good starting point for writing a report (01:45:29).
  • The automatic report is not as good as a handwritten report, but it is a useful option for generating reports quickly (01:45:47).
  • Users can save the code used to generate the report, making it easy to run multiple analyses and generate automated reports (01:46:02).
  • The process of merging files and editing them down to what is needed is more efficient than manually running and copying and pasting everything, especially when running multiple analyses (01:46:34).
  • The output options for the process can be customized, although the specific details of the options are not recalled (01:46:49).
  • Downloading graphics from the process is also an option, making it a fantastic choice, especially when running multiple analyses for a dissertation (01:47:11).
  • Running multiple analyses can be a lot of work, but this process provides a nice option to make it easier (01:47:22).
  • The ability to generate supplementary output, such as screen plots, is a useful feature that is hoped to be implemented in other software and R packages (01:47:39).
  • The factoextra package has a feature that allows for the generation of nice output, which is a useful tool for interpreting analyses (01:47:56).
  • The simple R package, or ecology of packages, has tools that can help with interpreting analyses and generating good output (01:48:18).

Scree Plots and Determining the Number of Singular Values

  • Screen plots will be discussed in more detail later in the semester when talking about inference (01:48:40).
  • The scree plot is generated using the factoextra package and shows the actual, observed eigenvalues, which can be used to determine the number of Singular value to keep (01:49:04).
  • The factoextra package has multiple methods for determining the number of singular values to keep, including parallel analysis, optimal coordinates, and acceleration factor (01:50:04).
  • The parallel analysis method regenerates the eigenvalue greater than one criterion, which states that if an eigenvalue is greater than one, it accounts for more variability than one variable (01:50:15).
  • Optimal coordinates in Principal component analysis draw a line through the points, similar to a regression line, to identify outliers and keepers, which are points above the line (01:50:49).
  • Bootstrapping in PCA involves running simulations from the data set to obtain Eigenvalues or squared singular values, which can be used to determine the number of factors to keep (01:51:14).
  • The Eigenvalues obtained from bootstrapping can be used to assess the number of factors to keep, with values greater than one indicating important factors (01:51:41).
  • The paper “end factors” describes various methods for determining the number of factors to keep, including different criteria and rules of thumb (01:52:50).

Supplementary Data and Group Comparisons

  • Supplementary data in Principal component analysis refers to points or variables that are not used to compute the overall point cloud but are projected into the space, allowing for the analysis of additional data (01:53:19).
  • Supplementary points can be used to analyze the relationship between different groups, such as males and females, by projecting one group into the space defined by the other group (01:53:46).
  • Variables can also be made supplementary, allowing for the analysis of additional variables in the same space (01:54:26).
  • Categories can be made supplementary as well, enabling the analysis of categorical data in the same space (01:54:41).
  • The plot of supplementary data can show the relationship between different groups or variables, with the black dots representing the original data points and the blue dots representing the supplementary points (01:55:10).
  • The representation of females in the space defined by the male soldiers is being examined to see how well they are represented by the existing system, with the ideal outcome being a fairly overlapping distribution of female cases within the general vicinity of the male cases (01:55:14).
  • However, it is observed that there are many female cases that are not appearing within the point cloud, indicating that the existing system may not be suitable for females (01:55:55).
  • A similar analysis can be applied to the Parkinson’s data by comparing the controls and clinical cases to assess how well the clinical cases are represented by the system defined by the controls (01:56:06).
  • Visualizing the data can be a powerful tool to understand the overlap between different groups, such as the black dots and blue dots, and to determine how well a system that works for one group will work for another (01:56:29).
  • The lack of overlap between the black dots and blue dots suggests that a system that works for the black dots may not be effective for the blue dots (01:56:45).
  • A report on this analysis might include a plot of the data using a function such as hf.sub.dot.Principal component analysis to visualize the probability ellipses around the different groups (01:57:07).
  • The plot ellipses function can be used to compute the overlap between the different groups and to visualize how distinct they are from each other (01:58:17).
  • This type of analysis can be applied to supplementary data by computing the PCA and using the formulas to project new cases into the space (01:59:10).
  • The space in a plot can be used to visualize supplementary variables, which appear as dashed lines, allowing for an understanding of how well a variable is represented by the existing structure (01:59:33).
  • This method can be used to assess for outliers by identifying individuals or data points that are not well-represented in the existing space (02:00:15).
  • Supplementary individuals can be visualized in the plot, showing their distances from the center of the point cloud, which are likely Mahalanobis distance distances (02:00:50).

Multi-Dimensional Scaling (MDS) and Future Topics

  • The next topic to be covered is multi-dimensional scaling (Multidimensional scaling), which focuses on analyzing the distances between data points (02:01:37).
  • MDS is closely related to Principal component analysis but offers more freedom and allows for interesting analyses that cannot be done with PCA (02:02:01).
  • The topics to be covered after MDS include correspondence analysis and clustering (02:02:21).