MVA Class 2 — Distance

Class Logistics and Introduction
- The class recordings will be available for students to watch again
if needed, and the instructor had to briefly pause the recording due to
an issue with their cat (00:00:28).
- The instructor discovered typos and mistakes in the syllabus on the
day of the class, but they are minor and the online version is correct
(00:02:08).
- The class is a multivariate analysis and machine learning class,
with a focus on Multivariate
statistics taught from a modern angle (00:02:47).
- There will be a full machine learning class in the spring that will
cover more material, but this class will still cover some machine
learning topics (00:02:51).
Instructor and Communication
- The instructor’s name is Jaber, and students can contact him via
direct message on Teams, as email and phone may not be reliable (00:03:30).
Course Content and Overview
- The class covers multivariate data from a modern perspective, and
the instructor will add to the existing knowledge in the field (00:04:23).
- An example of multivariate data is text analytics, such as comparing
coded qualitative interviews, which will be covered later in the
semester (00:05:08).
- Image
analysis is a topic that the instructor wants to incorporate into
the class but finds it too difficult due to data handling requirements
(00:05:59).
- A grayscale image can be thought of as a matrix with numbers,
allowing for image analysis, and can be considered as multivariable data
(00:06:18).
- Color images are more complicated than grayscale images and require
advanced math to work with, and will be touched upon in the context of
color vision and preference data (00:06:41).
- Preference data involves comparing different stimuli and deciding
which one is preferred, with preference relationships being potentially
complicated (00:07:03).
- Educational test data and psychological test data, such as the PHQ-9 or state
accountability tests, are examples of multivariate data that are usually
analyzed using methods like IRT or structural equation modeling (00:07:29).
- Genetic studies, such as an epigenetic experiment with 49 subjects
and 7,000 genes, can be analyzed using multivariate analysis (00:08:02).
- Biosensor data, like Electroencephalography
data with multiple measurement spots on the head, can be analyzed using
multivariate analysis (00:08:42).
- Allometry, the
study of shape, can be applied to understand how things change over
time, such as how children grow and develop (00:09:01).
- Regression
analysis often involves predicting one outcome with a multivariate
space of predictors, and understanding the relationships between the
predictors can help diagnose issues like colinearity (00:09:54).
- Multivariate data often requires Data reduction
to visualize and analyze, as it is not possible to visualize a large
number of variables, such as 7,000 variables (00:10:23).
- Data reduction involves reducing the dimensionality of the data to
make it more manageable, and also involves removing noise and redundancy
(00:10:27).
- The course will cover multivariate analysis methods, focusing on
linear algebra, and will also touch on other mathematical concepts, with
the goal of familiarizing students with important math for multivariate
analysis (00:11:13).
Statistical Approach and Methods
- In statistics, having a grounding in the substance of what is being
studied is crucial for doing good analysis and answering useful and
important questions (00:11:34).
- There are different schools of thought in approaching statistics,
and the approach taken in this course will be from the French language
or Dutch
language schools, which focus on theoretically motivated data
description over formal statistical inference (00:12:37).
- This approach is different from the Anglo American
plc school, which focuses on hypothesis testing, and will not be the
primary focus of the course (00:12:56).
- The course will cover topics such as resampling statistics,
bootstrapping, and other methods, but will not focus on topics like Multivariate
analysis of variance, which is considered to be more relevant to
other classes like HLM or SEM (00:13:05).
- The French School of statistics was known for working on smaller
problems and using smaller data sets, and the course will cover both
small and big data sets (00:14:11).
- The methods covered in the course have been developed for
high-dimensional environments, where the number of variables exceeds the
number of observations, and will address issues that arise in such
situations (00:15:08).
- The course will spend time thinking about how to deal with more
variables than cases, which is a common problem in high-dimensional data
analysis (00:15:42).
Multivariate Analysis and Machine Learning
- The border between multivariate analysis and machine learning is not
clearly defined, and the last third of the class will cover machine
learning topics that build upon previously studied concepts (00:15:47).
Class Recordings and Access
- The class is recorded, and a Zoom
Video Communications link will be provided after a couple of hours
of processing, usually posted on the day of class or the day after (00:16:26).
Prerequisites and Software
- Course prerequisites include familiarity with algebra, and linear
algebra will be covered in the class, including matrices and other
related topics (00:17:01).
- The class will use R, and students are recommended to have prior
knowledge of R or take an introduction to R before the class, as the R
code used will be kept simple (00:17:32).
- Students should be familiar with R data types, such as vectors,
matrices, and data frames, as well as functions like coercion (00:18:25).
In-Class Activities and Code
- The instructor will post a link to the code for in-class activities,
usually a day before or on the day of class, which will cover 90-95% of
the material (00:19:04).
- The instructor will occasionally update the code and post a fixed
version if mistakes are found during class (00:19:38).
- Students are encouraged to follow along with the code during class
(00:19:51).
- The instructor will be doing examples in the last hour of class and
encourages students to follow along and ask questions (00:19:55).
R Packages
- Several packages will be used in the class, including stat, match,
M, psycho, R, factoMineR, vegan, smuff, ellipse, and anacore (00:20:10).
- factoMineR will be used the most, as it does 80% of what is needed
and has some really nice features, as well as extra packages that do
machine learning things that integrate nicely with factoMineR (00:20:37).
Recording Policy and Availability
- The instructor has a policy of not deleting recordings, but rather,
they stay around and everyone loses access to them, and then Zoom
Video Communications auto-deletes them after a while (00:21:39).
- Students will have access to the recordings throughout the semester,
but are asked to be judicious about sharing them with others (00:21:56).
- If the recorder is not turned on or something happens during class,
the instructor will redo the class and make a new recording (00:22:39).
- There are also additional videos available that provide a more
leisurely tour through matrices, which will be posted for students to
access (00:23:01).
- The videos can be watched at 1.5 speed for review, and the
instructor assumes that students will use them as a resource (00:23:26).
- The videos will be available under the video section in Teams, and
the instructor no longer needs to upload them manually due to changes in
Zoom’s video retention policy (00:23:39).
- Links to Zoom
Video Communications recordings will be posted in the video section
of the team, and they will include a passcode for access, which will be
provided (00:24:32).
Textbook and Recommended Readings
- The textbook for the class is by Hardle and Simar, and the
instructor will show the PDFs later in the class (00:24:59).
- The instructor has a list of recommended books, including
“Understanding B Plots” by Gower, Lubbe, and Le Roux, which is not
available as a PDF, and “Modern Psychometrics with R” by Patrick Meyers,
which is highly recommended and has been uploaded (00:25:12).
- Another recommended book is “Multivariate Analysis” by Maria Kent
and B.B. Y, which was written in 1979 and has a lot of math, but is a
rewarding book for those who want a deep knowledge of the topic (00:26:02).
- A second edition of “Multivariate Analysis” was recently published,
which was a surprise to the instructor, and it is also a good resource
(00:26:59).
Assessment and Grading
- The course assessment will consist of three problem sets, which will
be due in late September, early November, and during finals week, and
the first assignment will be uploaded soon (00:27:52).
- The problem sets are not cumulative, but the material is, so it is
essential to understand the earlier material to make sense of the later
material (00:28:11).
- The instructor often uses the same data sets multiple times, so
students will work with the same data set in different assignments to
gain experience and see how it is used (00:28:27).
- Homework assignments will be handled through Microsoft Teams, and
students will need to upload their responses in PDF format to receive
their grades back through the platform (00:29:59).
- If there are issues uploading files to Microsoft Teams, students can
email their files to the instructor, who will figure out a solution (00:30:29).
- Students are required to submit their code in a separate file, which
can be included at the end of the PDF if desired (00:30:46).
- It is essential to interpret results and not send uninterpreted
output or code, as the point of the assignment is to analyze and think
about the results, not just send code (00:31:01).
- If assignments are submitted without interpretation, they will be
returned and marked late (00:31:25).
- There is no need to submit assignments as R Markdown documents, as
some cases may cause R Markdown to fail and require complicated
workarounds (00:31:49).
- The instructor may use the “try” command in R Markdown to allow R to
fail, which can help students understand how to handle failures in their
analyses (00:32:03).
- Students may write graphs using stylist or pencil and paper (00:33:05).
- The instructor plans to update to Brightspace, a new Learning
management system (LMS), which is considered better than Blackboard
Learn (00:29:21).
- When submitting assignments, put problems in the specified order,
but it doesn’t matter what order they are completed in, just ensure they
are in the correct order in the document (00:33:12).
- Merge separate PDFs and optimize the file to avoid extremely large
documents, as files over 100 gigabytes are unnecessary for the class (00:33:30).
- The scoring rubric is straightforward, with guidelines for a good
answer including reasonable work shown, explanation of reasoning, and
not needing to be overly lengthy (00:34:02).
- Indicate what is written thought, computer code input, and computer
output, and do not mistake volume for quality, as shorter answers are
preferred (00:34:52).
- Late assignments will be penalized 10% if submitted more than a week
past the due date, but it’s essential to stay on top of homework to
avoid falling behind in the cumulative material (00:35:48).
- The Grading in
education used is the Grad Center standard, which may be subject to
change, but has remained consistent in the past (00:36:20).
- Incompletes are no longer loosely handled by the Grad Center, and
there is no discretion for the instructor to make exceptions, so it’s
essential to follow the rules (00:36:41).
- If an incomplete or unofficial withdrawal is needed, it’s crucial to
have a conversation about it early on and discuss a plan with the
department chair (00:37:17).
- If a student is having trouble keeping up with the coursework, they
should inform their advisor or someone above their level, so that their
instructors can be informed and necessary arrangements can be made (00:37:49).
- Incompletes and Withdrawals (WUs) are governed by Grad Center rules,
and students should familiarize themselves with these rules (00:38:37).
- International students or students with funding dependencies should
consult with their advisor and relevant offices before taking any
incompletes or withdrawals (00:39:00).
Course Structure and Topics
- The course is roughly divided into four parts, with the first five
classes covering concepts such as proximities, distances, matrices,
covariance matrices, and related quantities (00:39:47).
- The course will cover five different multivar techniques: Principal
component analysis, multi-dimensional scaling, correspondence
analysis, clustering, and discriminant analysis (00:41:22).
- These techniques are mostly unsupervised learning methods, except
for discriminant analysis, which is a supervised learning method (00:41:39).
- The course reading materials will be uploaded to the General files
section, and students can request additional materials if needed (00:41:59).
- The five multivar techniques covered in the course are variations on
a theme, with PCA being a generalized form that relates to the other
techniques (00:42:24).
- The topics to be covered are related to each other in various ways,
and the goal is to provide a comprehensive viewpoint on how they are
connected (00:42:33).
- The topics include regularization, dealing with situations where N
is less than P, and using an autoencoder, which is a machine learning
method closely related to PCA (00:42:51).
- An autoencoder is a generalization of Principal
component analysis and will be used to explore aspects of
statistical inference, including jackknifing, bootstrapping, and
permutation (00:43:09).
- Statistical inference will be covered, focusing on computationally
intensive methods, and measuring performance using cross-validation,
ROC, precision, and recall (00:43:41).
- Cross-validation is a method for measuring the performance of models
by taking a small chunk of the data out, fitting the model on the rest,
and then seeing how well it predicts the held-out data (00:44:15).
- The last example to be covered is spam filtering, a text analytics
method that involves using various techniques, including
cross-validation and Mahalanobis
distance distances (00:44:54).
Syllabus and Learning Approach
- The syllabus may be subject to change, and there may be a guest
speaker for the spam filtering topic (00:45:28).
- When reading the syllabus, it’s recommended not to get bogged down
in the details, but rather to skim and revisit important points as
needed (00:46:02).
- It’s recommended to ask questions in class and approach the work
without feeling the need to know everything, as no one can know
everything (00:46:51).
Course Materials and Resources
- All necessary files for the class will be placed in the “General
Files” and “Class Materials” sections, making it easier to find the
required materials (00:47:47).
- If students need to message the instructor, they can use the chat
option in Teams or send an email, but Teams alerts are more reliable,
especially on mobile devices (00:48:22).
- Homework assignments will be posted in the “Assignments” section,
and students will be required to upload their work there (00:49:01).
- If hidden channels appear in Teams, students should notify the
instructor, who will make the channel visible (00:49:13).
- Teams is primarily designed for business use and is not a Learning
management system (LMS), but it is still a useful tool for the class
(00:49:37).
Example Dataset and Analysis
- The instructor will be using R code and data from a book by Bernard
Flury, a specialist in multivariate analysis, to work through an example
(00:50:14).
- The book by Bernard Flury is highly recommended, but it is
challenging and has a lot of mathematical content (00:50:36).
- Although the book is not assigned, it is available for students to
use, and the instructor will be using some examples from it in the class
(00:51:41).
- A dataset from a book contains facial measurements designed to
create better masks, which were collected in the 1980s (00:52:08).
- The data were found to be interesting and relevant during the
COVID-19 pandemic, despite being old, as they could be used to
understand physical measurements and the challenges associated with them
(00:53:27).
- The dataset is simpler compared to more complex data used in facial
recognition or automated emotion capture, but it still provides insight
into physical measurements (00:54:16).
- The measurements in the dataset are referred to as “landmarks,”
which are used to track specific points on the face (00:56:20).
- The reason landmarks are used is that some facial measurements, such
as the distance from the nose to the tip, do not change significantly,
making them useful for analysis (00:56:36).
- A professor at New York
University, Ian Reed, was mentioned as someone who worked on facial
expression modeling, and another professor, Josh Aronson, was mentioned
as someone known for his work in the field (00:55:28).
- The conversation also mentioned a professor who worked on
micro-expression detection research, analyzing videos to detect subtle
facial expressions (00:55:05).
- Landmark measures are typically used to track facial features that
do not change from one expression to another, making them useful for
tasks like creating masks that fit different face shapes (00:56:50).
- These measures can be used to focus on the parts of the face that
change, rather than tracking extra features, and are essential for
creating accurate masks (00:57:08).
- The measurements include distances from the point of the chin to the
ear, the bridge of the nose to the ear, and other similar metrics (00:57:23).
- A study measured 25 variables in 900 members of the Swiss Army, but
only six of these variables are considered most important for the fit of
protective masks (00:58:01).
- The six key variables are defined in Figure 1.5, and data is
available for 200 male soldiers and female soldiers (00:58:11).
- The data will be used for discriminant analysis to determine if it
can distinguish between males and females, and if there is enough
information to generate a good analysis for each gender (00:58:31).
- The data and textbooks are available in PDF format, including works
by authors such as Husan, La, P, Herle, and Simon (00:59:12).
- The software FactoMineR will be used for analysis, and it provides a
rundown of its capabilities and examples (00:59:45).
- The necessary data and code have been uploaded, including the
FluryHData package, which contains the required datasets (01:00:21).
- To install the FluryHData package, users need to install it from a
local repository, selecting the package archive file and installing it
(01:00:52).
- The data sets used are not particularly interesting from a modern
perspective, but they serve as useful teaching examples (01:01:38).
- The data set is organized into two pieces: males and females, with a
binary indicator variable “female” where 0 represents men and 1
represents women (01:02:12).
- The data set has 259 rows and 7 columns, and it is a data frame
rather than a matrix because the “female” variable was made a factor (01:03:40).
- A factor in R is a coded categorical variable with a special data
type (01:04:11).
- The warnings in R can be turned off to avoid repetitive warnings,
especially when resizing plot windows, but it’s essential to turn them
back on to catch important warnings (01:04:50).
- The solution to avoid excessive warnings is to add a line of code at
the beginning of the file to turn off warnings, and then turn them back
on when needed (01:05:35).
- The screen is usually arranged to prioritize the code, with the data
sets section minimized since it’s already familiar (01:06:00).
- Multivariate analysis involves developing special methods to handle
larger data sets with higher dimensionality than those typically handled
in introductory statistics (01:06:39).
- In introductory statistics, smaller data sets are usually dealt
with, but as data sets grow, it becomes necessary to cope with larger
dimensionality and more variables (01:06:52).
- Descriptive statistics is crucial for identifying problems in data,
such as incorrect or impossible entry values, and should be run when
working with a new data set, regardless of the complexity of the
analysis (01:07:44).
- Descriptive statistics can be used to visualize and understand
higher dimensional data and lower dimensional spaces, making it easier
to identify patterns and issues (01:07:15).
- The summary function in R provides a lot of information, including
means, quantiles, and standard deviations, but can be overwhelming and
mixes different types of data (01:09:27).
- The summary function treats factors, such as the variable “female”,
differently than other variables, providing counts instead of
descriptive statistics (01:10:24).
- The measurements in the data set are in millimeters and appear to be
plausible, with no negative or zero values (01:10:58).
- A general rule in statistics is that if the mean and median differ
significantly, the data is likely not symmetric, but in this case, the
mean and median are similar (01:11:26).
- The term “marginal” in statistics refers to averaging over a
particular variable, such as females and males, and can be used to
understand the data from different perspectives (01:11:55).
- The means and standard deviations of different facial measurements
are being analyzed, with the means varying and the standard deviations
not differing strongly from each other, except for a few cases where
they are almost twice as large (01:12:36).
- Physical measurements tend to have larger standard deviations for
bigger measurements, which is a common phenomenon (01:13:25).
- The measurement with the higher standard deviation is the minimal
frontal breadth (mfb), with a value of 6.8, compared to Elgan (LTG) with
a value of 4.4 (01:13:50).
- The correlations between different measurements are all positive,
indicating a notion called size, where larger measurements are
associated with larger measurements on average (01:15:28).
- The correlations are not extremely large, with the largest being
around 0.6 and the smallest being around 0.1 (01:15:45).
- When analyzing the means and standard deviations separately for
males and females, it is found that male means are larger than female
means, but female standard deviations are larger than male standard
deviations, which is an unusual finding (01:16:48).
- The differences in means and standard deviations between males and
females suggest that it may not be appropriate to analyze the data with
males and females merged together (01:17:29).
- The analysis of the data will be done separately and then merged to
see the results from different approaches (01:17:34).
- A question is raised about the possibility of the standard deviation
of the female sample going down if a larger sample size were obtained,
given that the current sample size is smaller (01:17:43).
- It is acknowledged that in smaller samples, outliers can
significantly impact the results, and a larger sample size may reduce
the effect of these outliers (01:18:22).
- The importance of using the “hand up” feature in the online class is
emphasized to facilitate communication and avoid confusion (01:18:36).
- The conclusion drawn from the analysis is that the male sample is
reliable, but the female sample is too small to draw conclusive results
(01:19:24).
- The data will be revisited throughout the semester to explore
different methods of analysis (01:19:36).
- The possibility of using an autoencoder for outlier identification
is mentioned, as autoencoders are effective in identifying outliers (01:20:12).
- The importance of having a sufficient sample size is emphasized, and
it is suggested that a sample size of 59 women may not be enough to draw
conclusive results (01:20:47).
- The limitations of the data, including the fact that it is from the
1980s and may not be representative of the original data set, are
acknowledged (01:21:22).
- The correlations between variables were analyzed separately for
males and females, revealing a negative correlation in the female data
set that was not present in the male or mixed data sets (01:22:24).
- The correlations in the female data set seemed lower compared to the
male data set, while the male correlations were slightly higher (01:23:07).
- When two disparate groups are mixed, the correlation can be inflated
due to the averaging of correlations across groups, which is known as
“groupness” (01:23:35).
- The correlation can be affected by the way the groups are mixed, and
some of the correlation may be due to the differences between the groups
(01:24:00).
- Box plots are used to visualize the data, and they are oriented
around quantiles rather than means (01:25:42).
- The notch in the box plot represents an approximate confidence
interval for the median, indicating whether the medians are different
between groups (01:25:40).
- The box plot shows more variability in the female data set compared
to the male data set (01:26:08).
- Exploratory data analysis, such as the one being done, is essential
for understanding the data set and catching potential problems or issues
(01:26:21).
- Quantile plots
can be used to further analyze the data, and in this case, the plots
will be broken down by males and females (01:26:46).
- QQ plots are used for normality testing, and they can help determine
if data appears to be Normal
distribution. (01:26:54)
- The QQ plot for the whole data set shows that it does not appear to
be Gaussian, but within each group, the data appears to be Gaussian. (01:27:13)
- Simulation envelopes are used to show how different the sample
statistics could be and still be considered Gaussian. (01:27:43)
- The car package in R can be used to create QQ plots, and the
function to use is QQnorm() or possibly QQP(). (01:27:58)
- The data appears to be Non-Gaussianity
marginally, but this is because the data is a mix of two populations
(women and men) that differ strongly on the variable being tested. (01:28:49)
- The car package has a companion book called “Companion to Applied
Regression” that contains useful information and examples. (01:29:34)
- The car package also has an upgraded version of scatterplot matrices
that can show all pairs of variables, Quantile plots on the
diagonals, and probability ellipses. (01:29:57)
- The R Commander package can be helpful in creating graphs and code,
especially for complex functions, and can be used to interface with the
FactoMineR package. (01:30:36)
- Facto minor commands can be complicated with many options, making it
hard to figure them out without a GUI, which can lay out the options in
a helpful way (01:32:00).
- A plot version of the data shows negative correlation and potential
issues with the mfb variable, which may not be a problem but is worth
noting (01:32:42).
- The plot also shows that males tend to be smaller than females, with
some possible outliers (01:33:47).
- Ellipses in the plot are useful for diagnosing variability, but this
type of analysis is not feasible with a large number of variables (01:33:39).
- The car team is praised for their well-done Regression
analysis plots and visualizations, but these are not suitable for
large datasets (01:34:24).
- A heatmap function in base R can perform cluster analysis on cases
and variables separately, trying to identify which variables and cases
are more similar to each other (01:35:13).
- The heatmap shows that Elan has the smallest mean and LTG has the
largest mean, which is identified by the cluster analysis (01:36:34).
- The variables LTG, LGAN, and the rest are analyzed, and it’s found
that LTG has the largest value, while the rest are somewhat similar to
each other, but the cluster analysis isn’t very useful in this case (01:36:48).
- When there’s a massive trend with totally different means,
subtracting the means out can help get rid of the trend and look at the
data again (01:37:15).
- The scale function in R is used to center and/or scale the columns
of a matrix by subtracting out the means and possibly dividing by the
standard deviations (01:37:35).
- The sweep function is a more advanced function that allows for
centering, scaling, multiplying, and other operations in a column-wise
or row-wise manner (01:38:18).
- The scale function is used to remove the overall mean from each
column, and then a heatmap is generated to show which variables seem to
be more or less similar to each other (01:38:49).
- The heatmap shows that there are pockets of variables that are very
similar to each other, such as TFH (true facial height) and MFB, which
are very low and very small, respectively (01:39:35).
- The facto minor package is introduced, which is a package that will
be used a lot throughout the course, and it has an ecosystem of
associated packages, including MissMDA for missing data analysis and
multivariate analysis (01:40:09).
- Facto shiny is a package that allows for interactive visualizations
and can be used to ship off to a web page and run on the web (01:41:03).
- Facto investigate is a package that generates automated
interpretation for the results (01:41:25).
- The file generated by an automatic interpretation contains a Microsoft Word
document that can be edited and includes automated plots, available in
both English
language and French language
(01:41:35).
- The analysis method used is Principal
component analysis (Principal Components Analysis), a simple
multivariate analysis method that helps understand how variables are
related (01:42:27).
- In this analysis, an indicator variable is used to identify whether
a participant is male or female, without separating the data by gender
(01:42:49).
- PCA analyzes the rows of the data, which represent the participants,
and generates a plot showing the relationship between the participants
and the variables (01:43:21).
- The analysis shows that females are associated with Dimension One,
while males are associated with Two-dimensional
space, with some variables having a stronger association with one
dimension than the other (01:43:33).
- The number of dimensions is limited by the number of variables,
which is six in this case, and the analysis provides information on the
association between the variables and the dimensions (01:44:01).
- The variables used in the analysis include Elan (length from
globella to Apex Nai), TF (total facial height), and others, which are
correlated with each other (01:44:25).
- The analysis provides additional information, such as variable
percentage breakdowns and correlations, which can be accessed through a
summary report (01:45:22).
- FAO Miner integrates cluster analysis to help interpret the results
of Principal
component analysis and other analysis methods (01:45:51).
- The cluster analysis groups the individuals based on their
characteristics, with females and males forming separate clusters (01:46:17).
- The variables in the dataset are associated with face shape, with
Dimension 2 essentially representing long, rounder, or taller, narrower
faces (01:46:53).
- The clusters were made based on the person, using the full dataset
but projecting it down into the first two dimensions for visualization
(01:47:59).
- The two axes on the PCA graph represent the two PCA components,
which can be interpreted to understand how the variables are associated
with each other (01:48:58).
- The variables “Bam” and “Mfb” are closely associated with each
other, while another measure is associated with Dimension 2,
representing the height of the face (01:50:07).
- The other measures are more associated with pure size, and in an
allometric study, the size dimension is often removed to focus on
subsequent dimensions (01:50:51).
- The importance of considering face shape and size is illustrated by
the example of wearing masks, where different shaped faces can affect
the fit of the mask (01:51:24).
- Different types of masks fit better or worse, and the variables are
strongly associated with the first dimension, primarily oriented around
overall size, with examples including BAM and MFB, while LGAN and TFH
are more about shape and breadth of angular men (01:51:48).
- The association between variables is determined by angles, with
wider angles resulting in less association, and this is shown by
correlations such as cosiness (01:52:54).
- The number of correlations between variables increases rapidly, with
six variables resulting in 15 correlations, and 50 variables resulting
in 600 correlations, which is difficult to understand and manage (01:53:10).
- To manage this, correlations are “squished down” into a smaller
space, making it easier to cope with, and this is essentially what PCA
(Principal
component analysis) does, reducing variable space to a more
manageable level (01:53:51).
- The concept of variable space is also referred to as “person space”
(01:54:07).
- Probability ellipses can be used to show the distribution of cases,
such as 95% of male and female cases falling within certain areas, and
can help identify overlap and non-overlap between groups (01:54:31).
- A shiny app can be used to generate reports and provide graphical
options, allowing users to select the language, graphs, and other
features they want, and can also perform clustering and generate
automated reports (01:56:01).
- The report generated by the software will contain a lot of
information, giving a good idea of what the analysis entails, and it’s
recommended to let the software write the report to get a better
understanding (01:57:23).
- It’s suggested not to trust the software completely for analysis,
but rather use it to get suggestions and help, as it’s very good at
providing useful insights and ideas (01:57:40).
- The software can take graphs and allow users to alter and change
them as needed, making it a really cool tool for analysis (01:58:07).
- The software may take some time to generate reports, and it’s not
uncommon for it to time out, but it’s usually fine and can be run again
(01:58:25).
- Multivariate
statistics statistics are used to manage large datasets that can’t
be handled manually, and the software is helpful in this regard (01:59:12).
- Facto Miner is a recommended plugin for the software, as it has done
excellent work and has a lot of useful options and material (01:59:37).
- Facto Miner has a plugin for Commanders and can be used with Shiny,
making it a versatile tool (01:59:49).
- The software is generally well-regarded and has a lot of useful
features, making it a good choice for analysis (02:00:05).
- It’s recommended to try the software out personally, as it may work
better with fewer tabs open in the browser (02:00:59).
- The instructor usually provides the R file in advance, but it might
not be the final version, and it can be found under the “file” section,
often linked in the schedule (02:03:08).
- The instructor plans to upload the R file used in the class (02:02:28).
- The class that will be meeting in person will also be available
online, allowing students to listen and participate remotely if they
cannot attend in person (02:03:45).
- If a student wants to attend the in-person class but it is being
held virtually, they can still attend the in-person class, but it will
not be in a computer lab (02:04:18).
- The in-person class is currently held in the Ed Edge educational
psychology conference room, as the instructor found that using computer
labs was problematic due to issues with downloading packages (02:04:34).
- Students are encouraged to bring their laptops to the in-person
class, but if they do not have one, they can watch along with someone
who does or watch the class online later (02:05:03).
- The class recording will be available on the video channel after a
couple of hours of processing, and the link will be shared for those who
want to watch it again or review specific parts (02:06:47).
- The next class will be held in the Grad Center, room 3204, and
students who cannot physically attend are encouraged to participate
remotely (02:07:08).
- The next class will cover the topic of proximities and distances (02:07:35).
- There are two assigned readings for the next class: an article on
distances and another article that applies the concept of distances,
co-authored by the instructor (02:07:43).
- The instructor rarely assigns their own work but made an exception
for this article, which was written with a few graduate students a
couple of years ago during the COVID-19 pandemic (02:07:48).
- The instructor will contact Peter separately and bids farewell to
the rest of the class (02:08:24).