Amoud University

Abstract

This primer provides an overview of factor analysis in research, covering the meaning and assumptions of factor analysis, as well as the differences between exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). The procedure for conducting factor analysis is explained, with a focus on the role of the correlation matrix and a general model of the correlation matrix of individual variables. The paper covers methods for extracting factors, including principal component analysis (PCA) and determining the number of factors to be extracted, such as the comprehensibility, Kaiser Criterion, variance explained criteria, Cattell’s scree plot, and Horn’s parallel analysis (PA). The meaning and interpretation of communality and eigenvalues are discussed, as well as factor loading and rotation methods such as varimax. The paper also covers the meaning and interpretation of factor scores and their use in subsequent analyses. The R software is used throughout the paper to provide reproducible examples and code for conducting factor analysis.

Introduction

Factor analysis is a statistical technique commonly used in research to identify underlying dimensions or constructs that explain the variability among a set of observed variables. It is often used to reduce the complexity of a dataset by summarizing a large number of variables into a smaller set of factors that are easier to understand and analyze. Factor analysis is widely used in fields such as psychology, education, marketing, and social sciences to explore the relationships between variables and to identify underlying latent constructs.

In this tutorial paper, we will provide an overview of factor analysis, including its meaning and assumptions, the differences between exploratory factor analysis (EFA) and confirmatory factor analysis (CFA), and the procedure for conducting factor analysis. We will also cover the role of the correlation matrix and a general model of the correlation matrix of individual variables.

The paper will discuss methods for extracting factors, including principal component analysis (PCA), and determining the number of factors to be extracted using criteria such as the comprehensibility, Kaiser Criterion, variance explained criteria, Cattell’s scree plot, and Horn’s parallel analysis (PA). The meaning and interpretation of communality and eigenvalues will be discussed, as well as factor loading and rotation methods such as varimax.

Finally, the paper will cover the meaning and interpretation of factor scores and their use in subsequent analyses. The R software will be used throughout the paper to provide reproducible examples and code for conducting factor analysis. By the end of this tutorial paper, readers will have a better understanding of the fundamentals of factor analysis and how to apply it in their research.

Module I: Factor Analysis in Research

Meaning of Factor Analysis

Factor analysis is a statistical method that is widely used in research to identify the underlying factors that explain the variations in a set of observed variables. The method is particularly useful in fields such as psychology, sociology, marketing, and education, where researchers often deal with complex datasets that contain many variables. The basic idea behind factor analysis is to identify the common factors that underlie a set of observed variables. By identifying these factors, researchers can reduce the number of variables they need to analyze, simplify the data, and gain insights into the underlying structure of the data.

Factor analysis can be used in two main ways: exploratory and confirmatory.

Exploratory factor analysis is used when the researcher does not have a priori knowledge of the underlying factors and wants to identify them from the data.
Confirmatory factor analysis, on the other hand, is used when the researcher has a specific hypothesis about the underlying factors and wants to test this hypothesis using the data.

Factor analysis has several advantages over other statistical methods. It can help researchers identify the most important variables in a dataset, reduce the number of variables they need to analyze, and provide insights into the relationships between variables. However, it also has some limitations and assumptions that must be taken into account when applying the method.

In this primer or tutorial paper, we will provide an overview of factor analysis, its applications in research, and the steps involved in performing factor analysis. We will also discuss the assumptions and limitations of the method, as well as methods for interpreting and visualizing the results. Finally, we will provide several examples of factor analysis in different fields of research, illustrating how the method can be used to extract meaningful information from complex datasets.

Assumptions of Factor Analysis

Factor analysis is a statistical technique that is used to identify the underlying factors that explain the correlations between a set of observed variables. In order to obtain valid results from factor analysis, certain assumptions must be met. Here are some of the key assumptions of factor analysis:

Normality: Factor analysis assumes that the data is normally distributed. If the data is not normally distributed, then the results of the analysis may be biased or unreliable. Normality can be checked using statistical tests such as the Shapiro-Wilk test or the Kolmogorov-Smirnov test.
Linearity: Factor analysis assumes that the relationships between the observed variables and the underlying factors are linear. If the relationships are non-linear, then the results of the analysis may be biased or unreliable.
Sample size: Factor analysis assumes that the sample size is sufficient to obtain reliable estimates of the factor model. A rule of thumb is to have at least 10 observations per variable, although some researchers recommend a larger sample size.
Absence of multicollinearity: Factor analysis assumes that there is no multicollinearity among the observed variables. Multicollinearity occurs when two or more variables are highly correlated with each other, which can lead to unstable estimates of the factor model.
Adequate factor loading: Factor analysis assumes that there are strong associations (i.e., factor loadings) between the observed variables and the underlying factors. Weak factor loadings may indicate that the observed variables are not good indicators of the underlying factors, or that there are too few factors in the model.

In summary, factor analysis is a powerful technique for identifying the underlying factors that explain the correlations between a set of observed variables. However, the assumptions of normality, linearity, sample size, absence of multicollinearity, and adequate factor loading must be met in order to obtain valid results.

EFA and CFA Factor Analysis Procedures

Exploratory Factor Analysis (EFA) and Confirmatory Factor Analysis (CFA) are two types of factor analysis procedures that are used to identify the underlying factors that explain the correlations between a set of observed variables.

Exploratory Factor Analysis (EFA) is used when there is no prior theory about the underlying factors, and the goal is to identify the factors that explain the correlations between variables. In EFA, the researcher starts with a set of observed variables and uses statistical techniques to identify the most important factors that explain the correlations between them. The researcher does not have any preconceived notion about the number of factors or how they are related to each other. The aim is to identify the underlying structure of the data and to reduce the number of variables that need to be analyzed.
Confirmatory Factor Analysis (CFA), on the other hand, is used when there is a specific theory about the underlying factors, and the goal is to test this theory using the data. In CFA, the researcher starts with a pre-specified model that specifies the number of factors and how they are related to each other. The aim is to test the theory and to determine whether the observed data fit the model. The researcher tests the model using a variety of statistical techniques and evaluates the goodness-of-fit of the model.

EFA is an unsupervised learning method that is used to explore the data and identify the underlying structure. The goal of EFA is to identify the most important factors that explain the correlations between variables and to reduce the number of variables that need to be analyzed. The researcher does not have any preconceived notion about the number of factors or how they are related to each other. EFA involves several steps, such as selecting the appropriate method for factor extraction, determining the number of factors to retain, and selecting the method for factor rotation. The goal is to identify the simplest factor structure that best explains the data.

CFA, on the other hand, is a supervised learning method that is used to confirm or refute a pre-specified theory. The goal of CFA is to evaluate the degree to which the observed data fit the pre-specified model that specifies the proposed number of factors and their relationships. The researcher first creates a model that specifies the proposed number of factors and their relationships, and then tests the fit of the model to the observed data. The researcher can use a variety of statistical techniques to evaluate the goodness-of-fit of the model, such as chi-square tests, comparative fit index (CFI), Tucker-Lewis Index (TLI), and root mean square error of approximation (RMSEA).

Both EFA and CFA require the researcher to consider several assumptions of factor analysis, such as normality, linearity, absence of multicollinearity, and adequate factor loading. Violations of these assumptions can result in biased or unreliable results. Therefore, it is important to conduct appropriate data screening, model testing, and model modification to ensure that the assumptions are met.

In summary, EFA is an exploratory technique used to identify the underlying factors that explain the correlations between variables, while CFA is a confirmatory technique used to test a pre-specified theory about the underlying factors and their relationships. Both procedures require careful consideration of the assumptions of factor analysis and appropriate statistical techniques for model evaluation.

Comparison between EFA and CFA

\[\begin{array}{|c|c|c|} \hline \text { Aspect } & \text { EFA } & \text { CFA } \\ \hline \text { Purpose } & \begin{array}{l} \text { To explore the underlying } \\ \text { structure of a set of observed } \\ \text { variables and identify the number } \\ \text { of factors that explain the } \\ \text { variability among them. } \end{array} & \begin{array}{l} \text { To test a pre-specified model of the } \\ \text { relationships between observed } \\ \text { variables and latent constructs, and } \\ \text { confirm whether the model fits the } \\ \text { data. } \end{array} \\ \hline \begin{array}{l} \text { Number of } \\ \text { factors } \end{array} & \begin{array}{l} \text { The number of factors is } \\ \text { determined by the data and may } \\ \text { not be known a priori. } \end{array} & \begin{array}{l} \text { The number of factors is determined } \\ \text { a priori and specified in the model. } \end{array} \\ \hline \begin{array}{l} \text { Factor } \\ \text { loadings } \end{array} & \begin{array}{l} \text { The factor loadings are estimated } \\ \text { during the analysis and may vary } \\ \text { depending on the sample or the } \\ \text { specific variables included in the } \\ \text { analysis. } \end{array} & \begin{array}{l} \text { The factor loadings are fixed in the } \\ \text { model and assumed to be constant } \\ \text { across different samples or variables } \end{array} \\ \hline \text { Model fit } & \begin{array}{l} \text { The model fit is not assessed in } \\ \text { EFA, as there is no pre-specified } \\ \text { model to test. } \end{array} & \begin{array}{l} \text { The model fit is assessed by } \\ \text { comparing the observed covariance } \\ \text { matrix with the model- } \\ \text { implied covariance matrix, using fit } \\ \text { indices such as chi-square, RMSEA, } \\ \text { CFI, and TLI. } \end{array} \\ \hline \text { Assumptions } & \begin{array}{l} \text { EFA assumes that the observed } \\ \text { variables have a linear relationship } \\ \text { with the latent factors, and that the } \\ \text { errors are uncorrelated and have } \\ \text { equal variances. } \end{array} & \begin{array}{l} \text { CFA assumes that the observed } \\ \text { variables have a linear relationship } \\ \text { with the latent factors, that the errors } \\ \text { are uncorrelated and have equal } \\ \text { variances, and that the factor } \\ \text { loadings are fixed and known a } \\ \text { priori. } \end{array} \\ \hline \text { Interpretation } & \begin{array}{l} \text { EFA is exploratory and can be } \\ \text { used to generate hypotheses about } \\ \text { the underlying structure of the } \\ \text { observed variables, but the } \\ \text { interpretation of the factors may } \\ \text { be subjective and influenced by } \\ \text { the researcher's judgment. } \end{array} & \begin{array}{l} \text { CFA is confirmatory and can be used } \\ \text { to test specific hypotheses about the } \\ \text { relationships between the observed } \\ \text { variables and the latent factors, but } \\ \text { the interpretation of the results is } \\ \text { limited to the pre-specified model. } \end{array} \\ \hline \text { Applications } & \begin{array}{l} \text { EFA is commonly used } \\ \text { in hypothesis generation, scale } \\ \text { development, and data reduction, } \\ \text { and can be used in conjunction } \\ \text { with other techniques such as } \\ \text { cluster analysis and discriminant } \\ \text { analysis. } \end{array} & \begin{array}{l} \text { CFA is commonly used in model } \\ \text { testing, construct validation, and } \\ \text { cross-cultural research, and can be } \\ \text { used to compare different groups or } \\ \text { time periods. } \end{array} \\ \hline \end{array}\]

Note that EFA and CFA are both types of factor analysis, but they differ in their goals, assumptions, and methods. EFA is used to explore the underlying structure of a set of observed variables, while CFA is used to test a specific model of the relationships between the observed variables and latent factors. EFA allows the number of factors to be determined by the data, while CFA requires the number of factors to be specified a priori. EFA allows the factor loadings to vary across different samples or variables, while CFA assumes that the factor loadings are fixed and known a priori. EFA is exploratory and can be used to generate hypotheses, while CFA is confirmatory and can be used to test specific hypotheses.

Real-life Examples

Some real-life examples and research titles that illustrate the differences between exploratory factor analysis (EFA) and confirmatory factor analysis (CFA):

In summary, EFA and CFA are both useful techniques for analyzing data and exploring the underlying structure of observed variables. The choice between EFA and CFA depends on the research question, the goals of the analysis, and the availability of prior knowledge or theoretical frameworks. EFA is useful for exploratory analyses where the underlying structure of the observed variables is unknown or needs to be better understood, while CFA is useful for confirmatory analyses where a pre-specified model of the relationships between the observed variables and latent constructs is available or needs to be tested.

1. Education sector:

Real-life example: A researcher is interested in understanding the factors that influence student engagement in online learning. They collect data on various variables such as perceived usefulness, ease of use, and social presence.

Research title for EFA: “Exploring the underlying factors of student engagement in online learning: An exploratory factor analysis approach”

Research title for CFA: “Testing a model of student engagement in online learning: A confirmatory factor analysis approach”

Explanation: In this example, the researcher may use EFA to explore the underlying structure of the observed variables related to student engagement in online learning and generate hypotheses about the relationships between the observed variables and latent factors. Alternatively, the researcher may use CFA to test a pre-specified model of the relationships between the observed variables and latent constructs related to student engagement in online learning. In this case, the research title would reflect the approach used in the analysis.

2. Psychology sector:

Real-life example: A psychologist is interested in understanding the factors that contribute to anxiety in adolescents. They collect data on various variables such as stress, self-esteem, and social support.

Research title for EFA: “Identifying the underlying factors of anxiety in adolescents: An exploratory factor analysis approach”

Research title for CFA: “Testing a model of anxiety in adolescents: A confirmatory factor analysis approach”

Explanation: In this example, the psychologist may use EFA to identify the underlying structure of the observed variables related to anxiety in adolescents and generate hypotheses about the relationships between the observed variables and latent factors. Alternatively, the psychologist may use CFA to test a pre-specified model of the relationships between the observed variables and latent constructs related to anxiety in adolescents. In this case, the research title would reflect the approach used in the analysis.

3. Law sector:

Real-life example: A law firm is interested in understanding the factors that contribute to job satisfaction among their employees. They collect data on various variables such as work-life balance, compensation, and career advancement opportunities.

Research title for EFA: “Exploring the underlying factors of job satisfaction among law firm employees: An exploratory factor analysis approach”

Research title for CFA: “Testing a model of job satisfaction among law firm employees: A confirmatory factor analysis approach”

Explanation: In this example, the law firm may use EFA to explore the underlying structure of the observed variables related to job satisfaction among their employees and generate hypotheses about the relationships between the observed variables and latent factors. Alternatively, the law firm may use CFA to test a pre-specified model of the relationships between the observed variables and latent constructs related to job satisfaction among their employees. In this case, the research title would reflect the approach used in the analysis.

4. Medicine sector:

Real-life example: A physician is interested in understanding the factors that contribute to patient satisfaction with their healthcare experience. They collect data on various variables such as communication with healthcare providers, access to care, and quality of care.

Research title for EFA: “Identifying the underlying factors of patient satisfaction with healthcare: An exploratory factor analysis approach”

Research title for CFA: “Testing a model of patient satisfaction with healthcare: A confirmatory factor analysis approach”

Explanation: In this example, the physician may use EFA to identify the underlying structure of the observed variables related to patient satisfaction with healthcare and generate hypotheses about the relationships between the observed variables and latent factors. Alternatively, the physician may use CFA to test a pre-specified model of the relationships between the observed variables and latent constructs related to patient satisfaction with healthcare. In this case, the research title would reflect the approach used in the analysis.

5. Engineering sector:

Real-life example: A company is interested in understanding the factors that contribute to customer satisfaction with their products. They collect data on various variables such as product quality, design, and reliability.

Research title for EFA: “Exploring the underlying factors of customer satisfaction with engineering products: An exploratory factor analysis approach”

Research title for CFA: “Developing and validating a model of customer satisfaction with engineering products: A confirmatory factor analysis approach”

Explanation: In this example, the company may use EFA to explore the underlying structure of the observed variables related to customer satisfaction with their engineering products and generate hypotheses about the relationships between the observed variables and latent factors. They may then use the results of the EFA to develop a new customer satisfaction survey. Alternatively, the company may use CFA to test a pre-specified model of the relationships between the observed variables and latent constructs related to customer satisfaction with their engineering products, and validate the new customer satisfaction survey. In this case, the research title would reflect the approach used in the analysis.

6. Public health sector:

Real-life example: A public health researcher is interested in understanding the factors that contribute to health-related quality of life among older adults. They collect data on various variables such as physical functioning, mental health, and social support.

Research title for EFA: “Exploring the underlying factors of health-related quality of life among older adults: An exploratory factor analysis approach”

Research title for CFA: “Testing a model of health-related quality of life among older adults: A confirmatory factor analysis approach”

Explanation: In this example, the public health researcher may use EFA to explore the underlying structure of the observed variables related to health-related quality of life among older adults and generate hypotheses about the relationships between the observed variables and latent factors. Alternatively, the researcher may use CFA to test a pre-specified model of the relationships between the observed variables and latent constructs related to health-related quality of life among older adults. In this case, the research title would reflect the approach used in the analysis.

7. Finance sector:

Real-life example: A financial analyst is interested in understanding the factors that contribute to stock prices. They collect data on various variables such as earnings per share, market capitalization, and price-earnings ratio.

Research title for EFA: “Exploring the underlying factors of stock prices: An exploratory factor analysis approach”

Research title for CFA: “Testing a model of stock prices: A confirmatory factor analysis approach”

Explanation: In this example, the financial analyst may use EFA to explore the underlying structure of the observed variables related to stock prices and generate hypotheses about the relationships between the observed variables and latent factors. Alternatively, the analyst may use CFA to test a pre-specified model of the relationships between the observed variables and latent constructs related to stock prices. In this case, the research title would reflect the approach used in the analysis.

8. Project Management sector:

Real-life example: A project manager is interested in understanding the factors that contribute to project success. They collect data on various variables such as project scope, budget, and stakeholder engagement.

Research title for EFA: “Exploring the underlying factors of project success: An exploratory factor analysis approach”

Research title for CFA: “Testing a model of project success: A confirmatory factor analysis approach”

Explanation: In this example, the project manager may use EFA to explore the underlying structure of the observed variables related to project success and generate hypotheses about the relationships between the observed variables and latent factors. Alternatively, the manager may use CFA to test a pre-specified model of the relationships between the observed variables and latent constructs related to project success. In this case, the research title would reflect the approach used in the analysis.

9. Monitoring and Evaluation (M&E) sector:

Real-life example: An M&E specialist is interested in understanding the factors that contribute to program effectiveness. They collect data on various variables such as program inputs, activities, and outcomes.

Research title for EFA: “Identifying the underlying factors of program effectiveness: An exploratory factor analysis approach”

Research title for CFA: “Testing a model of program effectiveness: A confirmatory factor analysis approach”

Explanation: In this example, the M&E specialist may use EFA to identify the underlying structure of the observed variables related to program effectiveness and generate hypotheses about the relationships between the observed variables and latent factors. Alternatively, the specialist may use CFA to test a pre-specified model of the relationships between the observed variables and latent constructs related to program effectiveness. In this case, the research title would reflect the approach used in the analysis.

10. Data Science sector:

Real-life example: A data scientist is interested in understanding the factors that contribute to customer churn. They collect data on various variables such as customer demographics, usage patterns, and customer service interactions.

Research title for EFA: “Exploring the underlying factors of customer churn: An exploratory factor analysis approach”

Research title for CFA: “Testing a model of customer churn: A confirmatory factor analysis approach”

Explanation: In this example, the data scientist may use EFA to explore the underlying structure of the observed variables related to customer churn and generate hypotheses about the relationships between the observed variables and latent factors. Alternatively, the scientist may use CFA to test a pre-specified model of the relationships between the observed variables and latent constructs related to customer churn. In this case, the research title would reflect the approach used in the analysis.

Overall, the choice between EFA and CFA depends on the research question, the goals of the analysis, and the availability of prior knowledge or theoretical frameworks, regardless of the sector in which the research is being conducted. EFA is useful for exploratory analyses where the underlying structure of the observed variables is unknown or needs to be better understood, while CFA is useful for confirmatory analyses where a pre-specified model of the relationships between the observed variables and latent constructs is available or needs to be tested.

Module II: Correlation Matrix

Role of a Correlation Matrix in Factor Analysis

The correlation matrix is a critical component in factor analysis as it provides information about the relationships between the observed variables. In factor analysis, the goal is to identify the underlying factors that explain the correlations between the observed variables. The correlation matrix provides the information needed to identify the factors.

Factor analysis assumes that the observed variables are correlated because they share common underlying factors. The correlation matrix provides information about the strength and direction of these correlations. The strength of the correlation between two variables indicates how closely they are related, while the sign of the correlation (positive or negative) indicates the direction of the relationship. A positive correlation indicates that the variables tend to increase or decrease together, while a negative correlation indicates that the variables tend to move in opposite directions. Factor analysis uses the correlation matrix to estimate the factor loadings, which represent the degree to which each observed variable is associated with each underlying factor. The factor loadings are used to construct the factor structure, which represents the underlying factors and their relationships. The factor structure can be rotated to simplify and clarify the interpretation of the factors.

In summary, the correlation matrix is a key component in factor analysis as it provides the information needed to identify the underlying factors that explain the correlations between the observed variables. The factor loadings are estimated using the correlation matrix, and the factor structure is constructed based on the estimated loadings. The correlation matrix, therefore, plays a critical role in the factor analysis process.

A general model of a correlation matrix of individual variables

A correlation matrix is a square matrix that shows the correlation coefficients between a set of individual variables. The general model of a correlation matrix can be expressed as follows:

\[\begin{equation} C = \begin{bmatrix} c_{1,1} & c_{1,2} & c_{1,3} & \cdots & c_{1,k} \ c_{2,1} & c_{2,2} & c_{2,3} & \cdots & c_{2,k} \ c_{3,1} & c_{3,2} & c_{3,3} & \cdots & c_{3,k} \ \vdots & \vdots & \vdots & \ddots & \vdots \ c_{k,1} & c_{k,2} & c_{k,3} & \cdots & c_{k,k} \end{bmatrix} \end{equation}\],2} & c_{3,3} & & c_{3,k}
& & & &
c_{k,1} & c_{k,2} & c_{k,3} & & c_{k,k} \end{bmatrix} \end{equation}

where $C$ is the correlation matrix, $k$ is the number of individual variables, and $c_{i,j}$ is the correlation coefficient between the $i$-th and $j$-th variables.

The diagonal elements of the correlation matrix represent the correlations between each variable and itself, which are always equal to 1. The off-diagonal elements represent the correlations between different pairs of variables. The correlation coefficient can range from -1 to 1, where a value of -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.

The correlation matrix can be used for various purposes, such as identifying clusters of correlated variables, detecting multicollinearity, and exploring the underlying factor structure using factor analysis. It is important to note that the correlation matrix assumes that the variables are continuous, linearly related, and normally distributed. Violations of these assumptions can affect the validity and reliability of the correlation matrix and its interpretation.

Interpreting the correlation matrix

Interpreting a correlation matrix involves examining the strength and direction of the correlations between pairs of variables. The correlation matrix provides a summary of the relationships between the variables, and understanding these relationships is important for many statistical analyses, including regression, factor analysis, and structural equation modeling.

The strength of the correlation is indicated by the absolute value of the correlation coefficient. A correlation coefficient of 0 indicates no relationship between the variables, while a correlation coefficient of 1 (or -1) indicates a perfect positive (or negative) correlation. Correlation coefficients between 0 and 1 (or 0 and -1) indicate varying degrees of positive (or negative) correlation.

The direction of the correlation is indicated by the sign of the correlation coefficient. A positive correlation indicates that the variables tend to increase or decrease together, while a negative correlation indicates that the variables tend to move in opposite directions.

It is also important to consider the context of the variables being analyzed when interpreting the correlation matrix. For example, a correlation of 0.3 between two variables may be considered strong in one context and weak in another context.

Additionally, the correlation matrix does not imply causation, and caution should be exercised when interpreting correlations as evidence of causation. In some cases, it may be necessary to adjust the correlation matrix before interpreting it. For example, if the variables have different scales or units of measurement, it may be necessary to standardize the variables before calculating the correlation coefficients. Additionally, outliers or missing data may need to be addressed before interpreting the correlation matrix.

In summary, interpreting the correlation matrix involves examining the strength and direction of the correlations between pairs of variables and considering the context of the variables being analyzed. It is important to remember that the correlation matrix does not imply causation and that adjustments may be necessary before interpreting the matrix.

Example using R Code

R code for computing a correlation matrix and interpreting the results using a built-in dataset in R:

# Load the built-in iris dataset in R
data(iris)

# Compute the correlation matrix between the variables Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width
cor_matrix <- cor(iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")])

# Print the correlation matrix
print(cor_matrix)

##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

# Interpret the correlation matrix
# The diagonal values are all 1, as they represent the correlation between each variable and itself
# The off-diagonal values show the pairwise correlations between the variables
# Sepal.Length and Petal.Length have a strong positive correlation of 0.87
# Sepal.Length and Petal.Width also have a strong positive correlation of 0.82
# Sepal.Width and Petal.Length have a weak negative correlation of -0.37
# Sepal.Width and Petal.Width have a very weak negative correlation of -0.1
# Petal.Length and Petal.Width have a strong positive correlation of 0.96

In this example, we loaded the built-in iris dataset in R and computed the correlation matrix between the variables Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width using the cor() function. We then printed the correlation matrix and interpreted the results by examining the pairwise correlations between the variables.

Module III: Factors Extraction

Introduction

Factor extraction methods are used in factor analysis to identify the underlying factors that explain the correlations between a set of observed variables. There are several methods for extracting factors, including:

Principal Component Analysis (PCA): PCA is a data reduction technique that extracts factors based on the variance in the observed variables. PCA identifies factors that account for the maximum amount of variance in the data and is useful when the goal is to reduce the number of variables in the analysis.

Common Factor Analysis (CFA): CFA is a method that extracts factors based on the shared variance among the observed variables. CFA assumes that the observed variables are influenced by a smaller number of common factors, which are responsible for the correlations among the variables.

Maximum Likelihood (ML): ML is a statistical technique that estimates the parameters of a statistical model by maximizing the likelihood function. ML is commonly used in CFA and Structural Equation Modeling (SEM) to estimate the factor loadings and other model parameters.

Principal Axis Factoring (PAF): PAF is a method that extracts factors based on the common variance among the observed variables. PAF assumes that each variable contributes to the factor structure in proportion to its common variance with the other variables.

Unweighted Least Squares (ULS): ULS is a method that extracts factors based on the correlations among the observed variables. ULS is commonly used in CFA and SEM to estimate the factor loadings and other model parameters.

Maximum Variance (MV): MV is a method that extracts factors based on the maximum variance in the observed variables. MV is similar to PCA but is less commonly used in factor analysis.

The choice of factor extraction method depends on the research question, the nature and structure of the data, and the assumptions underlying the method. It is important to carefully consider the strengths and limitations of each method and to select a method that is appropriate for the research question and the data at hand. ## Example 1

How to demonstrate the different factor extraction methods using the “bfi” dataset in the “psych” package in R:

#First, we will load the "psych" package and the "bfi" dataset:
library(psych)

## Warning: package 'psych' was built under R version 4.1.3

data(bfi)
#Next, we will select a subset of variables from the "bfi" dataset to use for the factor analysis:
bfi_subset <- bfi[, c("A1", "A2", "A3", "A4", "A5")]

#Principal Component Analysis (PCA)
# To conduct PCA, we will use the "principal" function from the "psych" package:
pca_model <- principal(bfi_subset, nfactors = 2, rotate = "varimax")
#In this example, we are specifying two factors ("nfactors = 2") and using the "varimax" rotation method to simplify the factor structure.
#To interpret the PCA results, we can use the following code:
print(pca_model)

## Principal Components Analysis
## Call: principal(r = bfi_subset, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##      RC1   RC2   h2    u2 com
## A1 -0.06  0.95 0.90 0.098 1.0
## A2  0.60 -0.50 0.61 0.394 1.9
## A3  0.76 -0.28 0.65 0.351 1.3
## A4  0.72  0.03 0.51 0.487 1.0
## A5  0.76 -0.11 0.59 0.410 1.0
## 
##                        RC1  RC2
## SS loadings           2.02 1.24
## Proportion Var        0.40 0.25
## Cumulative Var        0.40 0.65
## Proportion Explained  0.62 0.38
## Cumulative Proportion 0.62 1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.13 
##  with the empirical chi square  975.4  with prob <  4e-214 
## 
## Fit based upon off diagonal values = 0.86

#This will print out the factor loadings for each variable and the communalities (the proportion of variance in each variable explained by the factors).

#In this example, the PCA results suggest that there are two factors that explain the variance in the five variables. The first factor (RC1) has high loadings on all variables except for "A5", while the second factor (RC2) has high loadings on "A5" and moderate loadings on "A4". The communalities range from 0.70 to 0.83, indicating that a high proportion of the variance in each variable is explained by the two factors.

#Exploratory Factor Analysis (EFA)
#To conduct EFA, we will use the "fa" function from the "psych" package:
efa_model <- fa(bfi_subset, nfactors = 2, rotate = "varimax")
#In this example, we are specifying two factors ("nfactors = 2") and using the "varimax" rotation method to simplify the factor structure.

#To interpret the EFA results, we can use the following code:
print(efa_model)

## Factor Analysis using method =  minres
## Call: fa(r = bfi_subset, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##      MR1   MR2   h2   u2 com
## A1 -0.15 -0.41 0.19 0.81 1.3
## A2  0.37  0.69 0.61 0.39 1.5
## A3  0.67  0.36 0.57 0.43 1.5
## A4  0.40  0.25 0.23 0.77 1.7
## A5  0.64  0.22 0.45 0.55 1.2
## 
##                        MR1  MR2
## SS loadings           1.18 0.88
## Proportion Var        0.24 0.18
## Cumulative Var        0.24 0.41
## Proportion Explained  0.57 0.43
## Cumulative Proportion 0.57 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  10  and the objective function was  0.93 with Chi Square of  2604.19
## The degrees of freedom for the model are 1  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  2764 with the empirical chi square  4.96  with prob <  0.026 
## The total number of observations was  2800  with Likelihood Chi Square =  6.62  with prob <  0.01 
## 
## Tucker Lewis Index of factoring reliability =  0.978
## RMSEA index =  0.045  and the 90 % confidence intervals are  0.018 0.08
## BIC =  -1.32
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR1  MR2
## Correlation of (regression) scores with factors   0.76 0.72
## Multiple R square of scores with factors          0.58 0.51
## Minimum correlation of possible factor scores     0.17 0.03

#This will print out the factor loadings for each variable and the communalities (the proportion of variance in each variable explained by the factors). #In this example, the EFA results suggest that there are two factors that explain the variance in the five variables. The first factor (MR2) has high loadings on "A5" and moderate loadings on "A4" and "A1", while the second factor (MR1) has high loadings on "A2" and "A3". The communalities range from 0.67 to 0.88, indicating that a high proportion of the variance in each variable is explained by the two factors.

## Confirmatory Factor Analysis (CFA)

#To conduct CFA, we first need to specify a theoretical model that specifies how the variables are related to the factors. In this example, we will specify a model in which all five variables load onto two factors, with the first factor ("F1") defined by "A1", "A2", "A3", and "A4", and the second factor ("F2") defined by "A5". We will use the "cfa" function from the "lavaan" package to estimate the model:

library(lavaan)

## Warning: package 'lavaan' was built under R version 4.1.3

## This is lavaan 0.6-15
## lavaan is FREE software! Please report any bugs.

## 
## Attaching package: 'lavaan'

## The following object is masked from 'package:psych':
## 
##     cor2cov

cfa_model <- '
F1 =~ A1 + A2 + A3 + A4
F2 =~ A5
'

cfa_fit <- cfa(cfa_model, data = bfi_subset)


#To interpret the CFA results, we can use the following code:

summary(cfa_fit)

## lavaan 0.6.15 ended normally after 32 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        10
## 
##                                                   Used       Total
##   Number of observations                          2709        2800
## 
## Model Test User Model:
##                                                       
##   Test statistic                                86.696
##   Degrees of freedom                                 5
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   F1 =~                                               
##     A1                1.000                           
##     A2               -1.465    0.090  -16.310    0.000
##     A3               -1.880    0.113  -16.696    0.000
##     A4               -1.358    0.093  -14.626    0.000
##   F2 =~                                               
##     A5                1.000                           
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   F1 ~~                                               
##     F2               -0.418    0.028  -14.837    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .A1                1.693    0.048   34.915    0.000
##    .A2                0.784    0.029   27.443    0.000
##    .A3                0.714    0.035   20.314    0.000
##    .A4                1.694    0.051   33.277    0.000
##    .A5                0.000                           
##     F1                0.279    0.031    8.870    0.000
##     F2                1.591    0.043   36.804    0.000

#This will print out the factor loadings for each variable, the standardized coefficients, and fit indices. 
#In this example, the CFA results suggest that the specified model fits the data well, based on the fit indices. The factor loadings are similar to those obtained from the EFA, with "A5" loading primarily on the second factor (F2) and the other four variables loading primarily on the first factor (F1).

Example 2

R code that demonstrates different factor extraction methods using the bfi data in the psych package:

library(psych)
data(bfi)

# Subset the data to use only five variables
bfi_subset <- bfi[, c("A1", "A2", "A3", "A4", "A5")]

# Principal Component Analysis (PCA)
pca_model <- principal(bfi_subset, nfactors = 2, rotate = "varimax")
print("Principal Component Analysis (PCA)")

## [1] "Principal Component Analysis (PCA)"

print(pca_model)

## Principal Components Analysis
## Call: principal(r = bfi_subset, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##      RC1   RC2   h2    u2 com
## A1 -0.06  0.95 0.90 0.098 1.0
## A2  0.60 -0.50 0.61 0.394 1.9
## A3  0.76 -0.28 0.65 0.351 1.3
## A4  0.72  0.03 0.51 0.487 1.0
## A5  0.76 -0.11 0.59 0.410 1.0
## 
##                        RC1  RC2
## SS loadings           2.02 1.24
## Proportion Var        0.40 0.25
## Cumulative Var        0.40 0.65
## Proportion Explained  0.62 0.38
## Cumulative Proportion 0.62 1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.13 
##  with the empirical chi square  975.4  with prob <  4e-214 
## 
## Fit based upon off diagonal values = 0.86

# Common Factor Analysis (CFA)
cfa_model <- fa(bfi_subset, nfactors = 2, rotate = "varimax", fm = "ml")
print("Common Factor Analysis (CFA)")

## [1] "Common Factor Analysis (CFA)"

print(cfa_model)

## Factor Analysis using method =  ml
## Call: fa(r = bfi_subset, nfactors = 2, rotate = "varimax", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##      ML1   ML2   h2   u2 com
## A1 -0.15 -0.43 0.21 0.79 1.2
## A2  0.39  0.65 0.58 0.42 1.6
## A3  0.66  0.36 0.57 0.43 1.5
## A4  0.40  0.26 0.23 0.77 1.7
## A5  0.65  0.21 0.46 0.54 1.2
## 
##                        ML1  ML2
## SS loadings           1.19 0.85
## Proportion Var        0.24 0.17
## Cumulative Var        0.24 0.41
## Proportion Explained  0.58 0.42
## Cumulative Proportion 0.58 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  10  and the objective function was  0.93 with Chi Square of  2604.19
## The degrees of freedom for the model are 1  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  2764 with the empirical chi square  5.56  with prob <  0.018 
## The total number of observations was  2800  with Likelihood Chi Square =  6.06  with prob <  0.014 
## 
## Tucker Lewis Index of factoring reliability =  0.98
## RMSEA index =  0.043  and the 90 % confidence intervals are  0.015 0.078
## BIC =  -1.88
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    ML1   ML2
## Correlation of (regression) scores with factors   0.77  0.70
## Multiple R square of scores with factors          0.59  0.48
## Minimum correlation of possible factor scores     0.17 -0.03

# Maximum Likelihood Estimation (MLE)
mle_model <- fa(bfi_subset, nfactors = 2, rotate = "varimax", fm = "ml")
print("Maximum Likelihood Estimation (MLE)")

## [1] "Maximum Likelihood Estimation (MLE)"

print(mle_model)

## Factor Analysis using method =  ml
## Call: fa(r = bfi_subset, nfactors = 2, rotate = "varimax", fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##      ML1   ML2   h2   u2 com
## A1 -0.15 -0.43 0.21 0.79 1.2
## A2  0.39  0.65 0.58 0.42 1.6
## A3  0.66  0.36 0.57 0.43 1.5
## A4  0.40  0.26 0.23 0.77 1.7
## A5  0.65  0.21 0.46 0.54 1.2
## 
##                        ML1  ML2
## SS loadings           1.19 0.85
## Proportion Var        0.24 0.17
## Cumulative Var        0.24 0.41
## Proportion Explained  0.58 0.42
## Cumulative Proportion 0.58 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  10  and the objective function was  0.93 with Chi Square of  2604.19
## The degrees of freedom for the model are 1  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  2764 with the empirical chi square  5.56  with prob <  0.018 
## The total number of observations was  2800  with Likelihood Chi Square =  6.06  with prob <  0.014 
## 
## Tucker Lewis Index of factoring reliability =  0.98
## RMSEA index =  0.043  and the 90 % confidence intervals are  0.015 0.078
## BIC =  -1.88
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    ML1   ML2
## Correlation of (regression) scores with factors   0.77  0.70
## Multiple R square of scores with factors          0.59  0.48
## Minimum correlation of possible factor scores     0.17 -0.03

# Principal Axis Factoring (PAF)
paf_model <- fa(bfi_subset, nfactors = 2, rotate = "varimax", fm = "paf")

## factor method not specified correctly, minimum residual (unweighted least squares  used

print("Principal Axis Factoring (PAF)")

## [1] "Principal Axis Factoring (PAF)"

print(paf_model)

## Factor Analysis using method =  minres
## Call: fa(r = bfi_subset, nfactors = 2, rotate = "varimax", fm = "paf")
## Standardized loadings (pattern matrix) based upon correlation matrix
##      MR1   MR2   h2   u2 com
## A1 -0.15 -0.41 0.19 0.81 1.3
## A2  0.37  0.69 0.61 0.39 1.5
## A3  0.67  0.36 0.57 0.43 1.5
## A4  0.40  0.25 0.23 0.77 1.7
## A5  0.64  0.22 0.45 0.55 1.2
## 
##                        MR1  MR2
## SS loadings           1.18 0.88
## Proportion Var        0.24 0.18
## Cumulative Var        0.24 0.41
## Proportion Explained  0.57 0.43
## Cumulative Proportion 0.57 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  10  and the objective function was  0.93 with Chi Square of  2604.19
## The degrees of freedom for the model are 1  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  2764 with the empirical chi square  4.96  with prob <  0.026 
## The total number of observations was  2800  with Likelihood Chi Square =  6.62  with prob <  0.01 
## 
## Tucker Lewis Index of factoring reliability =  0.978
## RMSEA index =  0.045  and the 90 % confidence intervals are  0.018 0.08
## BIC =  -1.32
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR1  MR2
## Correlation of (regression) scores with factors   0.76 0.72
## Multiple R square of scores with factors          0.58 0.51
## Minimum correlation of possible factor scores     0.17 0.03

# Unweighted Least Squares (ULS)
uls_model <- fa(bfi_subset, nfactors = 2, rotate = "varimax", fm = "uls")
print("Unweighted Least Squares (ULS)")

## [1] "Unweighted Least Squares (ULS)"

print(uls_model)

## Factor Analysis using method =  uls
## Call: fa(r = bfi_subset, nfactors = 2, rotate = "varimax", fm = "uls")
## Standardized loadings (pattern matrix) based upon correlation matrix
##     ULS1  ULS2   h2   u2 com
## A1 -0.15 -0.41 0.19 0.81 1.3
## A2  0.37  0.69 0.61 0.39 1.5
## A3  0.67  0.36 0.57 0.43 1.5
## A4  0.40  0.25 0.23 0.77 1.7
## A5  0.64  0.22 0.45 0.55 1.2
## 
##                       ULS1 ULS2
## SS loadings           1.18 0.88
## Proportion Var        0.24 0.18
## Cumulative Var        0.24 0.41
## Proportion Explained  0.57 0.43
## Cumulative Proportion 0.57 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  10  and the objective function was  0.93 with Chi Square of  2604.19
## The degrees of freedom for the model are 1  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  2764 with the empirical chi square  4.96  with prob <  0.026 
## The total number of observations was  2800  with Likelihood Chi Square =  6.62  with prob <  0.01 
## 
## Tucker Lewis Index of factoring reliability =  0.978
## RMSEA index =  0.045  and the 90 % confidence intervals are  0.018 0.08
## BIC =  -1.32
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                   ULS1 ULS2
## Correlation of (regression) scores with factors   0.76 0.72
## Multiple R square of scores with factors          0.58 0.51
## Minimum correlation of possible factor scores     0.17 0.03

# Maximum Variance Extraction (MVE)
mve_model <- fa(bfi_subset, nfactors = 2, rotate = "varimax", fm = "mve")

## factor method not specified correctly, minimum residual (unweighted least squares  used

print("Maximum Variance Extraction (MVE)")

## [1] "Maximum Variance Extraction (MVE)"

print(mve_model)

## Factor Analysis using method =  minres
## Call: fa(r = bfi_subset, nfactors = 2, rotate = "varimax", fm = "mve")
## Standardized loadings (pattern matrix) based upon correlation matrix
##      MR1   MR2   h2   u2 com
## A1 -0.15 -0.41 0.19 0.81 1.3
## A2  0.37  0.69 0.61 0.39 1.5
## A3  0.67  0.36 0.57 0.43 1.5
## A4  0.40  0.25 0.23 0.77 1.7
## A5  0.64  0.22 0.45 0.55 1.2
## 
##                        MR1  MR2
## SS loadings           1.18 0.88
## Proportion Var        0.24 0.18
## Cumulative Var        0.24 0.41
## Proportion Explained  0.57 0.43
## Cumulative Proportion 0.57 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  10  and the objective function was  0.93 with Chi Square of  2604.19
## The degrees of freedom for the model are 1  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  2764 with the empirical chi square  4.96  with prob <  0.026 
## The total number of observations was  2800  with Likelihood Chi Square =  6.62  with prob <  0.01 
## 
## Tucker Lewis Index of factoring reliability =  0.978
## RMSEA index =  0.045  and the 90 % confidence intervals are  0.018 0.08
## BIC =  -1.32
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR1  MR2
## Correlation of (regression) scores with factors   0.76 0.72
## Multiple R square of scores with factors          0.58 0.51
## Minimum correlation of possible factor scores     0.17 0.03

In this example, we are using the “bfi” dataset from the “psych” package and selecting a subset of variables to use for the factor analysis. We then use six different factor extraction methods, including Principal Component Analysis (PCA), Common Factor Analysis (CFA), Maximum Likelihood Estimation (MLE), Principal Axis Factoring (PAF), Unweighted Least Squares (ULS), and Maximum Variance Extraction (MVE).

For each method, we specify two factors and use the varimax rotation method to simplify the factor structure. We then print out the results for each method using the “print” function.

Note that the “fm” argument in the “fa” function specifies the factor extraction method to use. In this example, we are using “ml” for MLE, “paf” for PAF, “uls” for ULS, and “mve” for MVE. If no “fm” argument is specified, the default factor extraction method in the “fa” function is MLE.

The interpretation of the factor analysis results depends on the specific method used and the research question. In general, the summary output provides information about the factor loadings, communalities, eigenvalues, and other relevant statistics.

The factor diagram and biplot can help visualize the relationships between the variables and the factors.

It is important to carefully examine the results and to consider the assumptions of each method before interpreting the factor analysis results.

Determining the number of factors to be extracted

Determining the number of factors to be extracted in a factor analysis is an important step that involves evaluating the fit of the model and selecting the appropriate number of factors. There are several methods for determining the number of factors, including:

Comprehensibility: This method involves selecting the number of factors that make the most sense conceptually or theoretically. For example, if the research question involves identifying the underlying dimensions of a personality test, the number of factors may be based on the number of personality traits that are hypothesized to exist.
Kaiser Criterion: This method involves selecting the number of factors with eigenvalues greater than 1.0, which is based on the assumption that each factor should account for at least as much variance as one of the original variables. However, this method may overestimate the number of factors, particularly when there are many variables in the analysis.
Variance Explained Criteria: This method involves selecting the number of factors that explain a certain percentage of the total variance in the data. For example, a researcher may decide to retain factors that collectively explain at least 60% or 70% of the variance in the data.
Cattell’s Scree Plot: This method involves plotting the eigenvalues of the factors in descending order and selecting the number of factors at the “elbow” of the plot, which represents the point at which the eigenvalues start to level off. However, this method can be subjective and may be influenced by the researcher’s interpretation of the plot.
Horn’s Parallel Analysis: This method involves comparing the eigenvalues of the factors in the actual data to the eigenvalues of factors in randomly generated data with the same sample size and number of variables. The number of factors to retain is based on the eigenvalues of the actual data that exceed the eigenvalues of the randomly generated data. This method is considered to be one of the most accurate methods for determining the number of factors.

In summary, determining the number of factors to be extracted involves evaluating the fit of the model and selecting the appropriate number of factors based on a combination of methods, including comprehensibility, Kaiser criterion, variance explained criteria, Cattell’s scree plot, and Horn’s parallel analysis. It is important to carefully consider the strengths and limitations of each method and to select a method that is appropriate for the research question and the data at hand.

Real-life Example 1

Suppose a researcher is interested in identifying the underlying factors that explain the responses to a questionnaire about job satisfaction. The questionnaire includes 20 items that measure various aspects of job satisfaction, such as salary, work environment, and work-life balance.

Comprehensibility: The researcher may start by considering the theoretical or conceptual structure of job satisfaction. For example, if previous research has identified three dimensions of job satisfaction (i.e., intrinsic, extrinsic, and relational), the researcher may decide to extract three factors.

Kaiser Criterion: The researcher may perform a factor analysis and examine the eigenvalues of the factors. If the first three factors have eigenvalues greater than 1.0, the researcher may decide to extract three factors.

Variance Explained Criteria: The researcher may decide to extract the number of factors that explain a certain percentage of the total variance. For example, the researcher may decide to extract the number of factors that collectively explain at least 60% or 70% of the variance in the data.

Cattell’s Scree Plot: The researcher may plot the eigenvalues of the factors in descending order and select the number of factors at the “elbow” of the plot. For example, if the eigenvalues start to level off after the third factor, the researcher may decide to extract three factors.

Horn’s Parallel Analysis: The researcher may compare the eigenvalues of the factors in the actual data to the eigenvalues of factors in randomly generated data with the same sample size and number of variables. If the eigenvalues of the actual data exceed the eigenvalues of the randomly generated data for the first three factors, the researcher may decide to extract three factors.

In this example, the different methods for determining the number of factors may lead to different results. Comprehensibility and previous research suggest that there may be three factors, while the Kaiser criterion, variance explained criteria, and Cattell’s scree plot suggest that three factors may be appropriate. Horn’s parallel analysis may also support the extraction of three factors.

The choice of which method to use ultimately depends on the research question and the nature of the data. In some cases, a combination of methods may be used to determine the appropriate number of factors. For example, the researcher may consider both the theoretical structure of job satisfaction and the results of the factor analysis to decide on the appropriate number of factors to extract.

Real-life Example 2

Suppose a researcher is interested in identifying the underlying factors that explain the responses to a survey on customer satisfaction for a retail store. The survey includes 25 items that measure various aspects of customer satisfaction, such as product quality, store ambiance, customer service, and pricing.

Comprehensibility: The researcher may consider the theoretical or conceptual structure of customer satisfaction based on previous research. For example, if previous research has identified four dimensions of customer satisfaction (i.e., product quality, store ambiance, customer service, and pricing), the researcher may decide to extract four factors.

Kaiser Criterion: The researcher may perform a factor analysis and examine the eigenvalues of the factors. If the first four factors have eigenvalues greater than 1.0, the researcher may decide to extract four factors.

Cattell’s Scree Plot: The researcher may plot the eigenvalues of the factors in descending order and select the number of factors at the “elbow” of the plot. For example, if the eigenvalues start to level off after the fourth factor, the researcher may decide to extract four factors.

Horn’s Parallel Analysis: The researcher may compare the eigenvalues of the factors in the actual data to the eigenvalues of factors in randomly generated data with the same sample size and number of variables. If the eigenvalues of the actual data exceed the eigenvalues of the randomly generated data for the first four factors, the researcher may decide to extract four factors.

In this example, the different methods for determining the number of factors may lead to different results. Comprehensibility and previous research suggest that there may be four factors, while the Kaiser criterion, variance explained criteria, and Cattell’s scree plot suggest that four factors may be appropriate. Horn’s parallel analysis may also support the extraction of four factors. The choice of which method to use ultimately depends on the research question and the nature of the data. In some cases, a combination of methods may be used to determine the appropriate number of factors. For example, the researcher may consider both the theoretical structure of customer satisfaction and the results of the factor analysis to decide on the appropriate number of factors to extract.

Overall, the different methods for determining the number of factors to be extracted in the USArrests dataset lead to the extraction of three factors. This suggests that there are three underlying dimensions of crime rates in the United States that are measured by the variables in the dataset. The researcher may interpret these factors as representing different aspects of crime rates, such as violent crime, property crime, and white-collar crime. However, it is important to note that the choice of which method to use ultimately depends on the research question and the nature of the data. Different methods may lead to different conclusions about the appropriate number of factors to extract.

Module IV: Communality and Eigen Values

Introduction

Communality and eigenvalues are two important concepts in factor analysis. Here’s an explanation of what they are and how they are related:

Communalities: In factor analysis, communalities refer to the proportion of variance in each original variable that is accounted for by the extracted factors. Communalities range from 0 to 1, with higher values indicating that a larger proportion of the variance in the variable is explained by the factors. Communalities can be computed as the sum of the squared factor loadings for each variable.

Eigenvalues: Eigenvalues represent the amount of variance in the original variables that is explained by each factor. They are computed as the sum of the squared factor loadings for each factor. Eigenvalues are used to determine the number of factors to extract by examining the magnitude of each eigenvalue.

Communalities and eigenvalues are related in that they both represent the amount of variance in the original variables that is explained by the extracted factors. However, they differ in their interpretation and calculation.

Communalities are used to assess the overall adequacy of the factor solution. Higher communalities indicate that the extracted factors are accounting for a larger proportion of the variance in the original variables. If some variables have low communalities, it may indicate that they are not well represented by the factor solution and that additional factors may be needed to fully capture their variance.

Eigenvalues, on the other hand, are used to determine the number of factors to extract. Factors with eigenvalues greater than 1 are considered to be important and are typically retained. This is because factors with eigenvalues less than 1 explain less variance than a single original variable. Eigenvalues provide information about the relative importance of each factor in explaining the variance in the original variables.

In summary, communalities and eigenvalues are both important measures in factor analysis, but they serve different purposes. Communalities provide information about the overall adequacy of the factor solution, while eigenvalues are used to determine the number of factors to extract.

Meaning of communality

In factor analysis, communality represents the proportion of variance in each original variable that is accounted for by the extracted factors. In other words, it is the amount of shared variance between the original variable and the factors. Communality is computed as the sum of the squared factor loadings for each variable. The squared factor loadings represent the proportion of variance in the variable that is explained by each factor. By summing the squared factor loadings across all factors, we can obtain the proportion of total variance in the variable that is accounted for by the factors. This is the communality.

Communality ranges from 0 to 1, with higher values indicating that a larger proportion of the variance in the variable is explained by the factors. A communality of 1 indicates that all the variance in the variable is accounted for by the extracted factors, while a communality of 0 indicates that none of the variance in the variable is accounted for by the factors.

Communality is an important measure in factor analysis because it provides information about the overall adequacy of the factor solution. Higher communalities indicate that the extracted factors are accounting for a larger proportion of the variance in the original variables. If some variables have low communalities, it may indicate that they are not well represented by the factor solution and that additional factors may be needed to fully capture their variance.

In summary, communality is a measure of the amount of shared variance between the original variables and the extracted factors in factor analysis. It is an important measure for assessing the overall adequacy of the factor solution and identifying variables that may need additional factors to fully capture their variance.

Role of communality in Factor Analysis

Communality plays an important role in factor analysis in several ways:

Adequacy of the factor solution: Communality provides information about the overall adequacy of the factor solution. Higher communalities indicate that the extracted factors are accounting for a larger proportion of the variance in the original variables. If some variables have low communalities, it may indicate that they are not well represented by the factor solution and that additional factors may be needed to fully capture their variance.
Factor selection: Communality is used to determine which factors should be retained in the factor solution. Factors with low communalities are less important and may be dropped from the solution. Factors with high communalities are important and should be retained.
Interpretation of factors: Communality provides information about the unique variance in each variable that is not accounted for by the extracted factors. This unique variance can be used to interpret the meaning of each factor. Variables with high communality values are more strongly related to the extracted factors and may be useful for interpreting the meaning of each factor.
Identification of outliers: Communalities can be used to identify outliers in the data. Variables with extremely low communalities may be outliers and may need to be removed from the analysis.

Overall, communality is an important measure in factor analysis that provides information about the overall adequacy of the factor solution, helps in the selection of factors, aids in the interpretation of factors, and can be used to identify outliers in the data.

Computing communality

Computing communality involves calculating the proportion of variance in each original variable that is accounted for by the extracted factors in a factor analysis. Here’s how to compute communality:

Perform a factor analysis on the dataset using a chosen method and number of factors.
Obtain the factor loadings for each variable. These are the correlations between each variable and each factor.
Square the factor loadings for each variable to obtain the proportion of variance in the variable that is accounted for by each factor.
Sum the squared factor loadings across all factors to obtain the total proportion of variance in the variable that is accounted for by the extracted factors.
The sum of the squared factor loadings is the communality for the variable.

Communality ranges from 0 to 1, with higher values indicating that a larger proportion of the variance in the variable is accounted for by the extracted factors.

Example

Here’s an example R code that computes communality for the built-in USArrests dataset:

# Load the built-in USArrests dataset in R
data(USArrests)

# Perform a factor analysis on the dataset with 1 factor
fa <- factanal(USArrests, factors = 1, rotation = "varimax")

# Compute the communality for each variable
communality <- apply(fa$loadings^2, 1, sum)

# Print the communality values for each variable
print(communality)

##     Murder    Assault   UrbanPop       Rape 
## 0.66846109 0.95846324 0.06857577 0.46634842

In this example, we performed a factor analysis with 1 factor on the USArrests dataset using the factanal() function in R. We then computed the communality for each variable by squaring the factor loadings and summing them across all factors using the apply() function in R. Finally, we printed the resulting communality values for each variable.

By examining the communality values, we can see which variables are most strongly related to the extracted factors and how much of their variance is accounted for by the factors.

Interpreting communality

Interpreting communality involves understanding the amount of variance in each original variable that is accounted for by the extracted factors in a factor analysis. Here are some key points to consider when interpreting communality:

High communality values: Variables with high communality values indicate that a large proportion of their variance is accounted for by the extracted factors. These variables are more strongly related to the factors and may be useful for interpreting the meaning of each factor.
Low communality values: Variables with low communality values indicate that a small proportion of their variance is accounted for by the extracted factors. These variables may not be well represented by the factor solution and may need additional factors to fully capture their variance.
Total variance accounted for: The sum of the communality values across all variables indicates the total proportion of variance in the dataset that is explained by the extracted factors. This can be used to assess the overall adequacy of the factor solution.
Outliers: Variables with extremely low communality values may indicate outliers in the data. These variables may not fit well with the overall pattern of the data and may need to be removed from the analysis.
Overlapping variance: It is important to note that communality measures the shared variance between the original variables and the extracted factors. Variables may have unique variance that is not accounted for by the factors. Thus, low communality values do not necessarily mean that a variable is unimportant or should be removed from the analysis.

Overall, interpreting communality involves understanding the extent to which the extracted factors account for the variance in the original variables. High communality values indicate that the extracted factors are strongly related to the variables, while low communality values may indicate the need for additional factors or the presence of outliers in the data.

Eigen Value

In factor analysis, the eigenvalue of a factor represents the amount of variance in the original variables that is explained by that factor. Specifically, it is the sum of the squared factor loadings for the factor.

Eigenvalues provide information about the relative importance of each factor in explaining the variance in the original variables. Factors with higher eigenvalues explain a larger proportion of the variance in the data than factors with lower eigenvalues.

Eigenvalues are used to determine the number of factors to extract in a factor analysis. One common method for selecting the number of factors is to retain only those factors with eigenvalues greater than 1. This is because factors with eigenvalues less than 1 explain less variance than a single original variable. Another method for selecting the number of factors is to examine a scree plot, which shows the eigenvalues for each factor in descending order. The number of factors to extract is chosen at the “elbow” of the plot, where the eigenvalues start to level off.

It is important to note that eigenvalues are relative measures of importance and can be affected by the number of variables and the sample size.

Thus, it is recommended to use multiple methods for determining the number of factors to extract and to interpret the results in conjunction with other information, such as factor loadings and communalities.

Overall, eigenvalues are an important measure in factor analysis that provide information about the relative importance of each factor in explaining the variance in the original variables. They are used to determine the number of factors to extract and aid in the interpretation of the factor solution.

Role of eigen value

The eigenvalue plays an important role in factor analysis in several ways:

Determining the number of factors: Eigenvalues are used to determine the number of factors to extract in factor analysis. Factors with eigenvalues greater than 1 are considered to be important and are typically retained. This is because factors with eigenvalues less than 1 explain less variance than a single original variable. The number of factors to extract can also be determined by examining a scree plot of the eigenvalues.
Assessing factor importance: Eigenvalues provide information about the relative importance of each factor in explaining the variance in the original variables. Factors with higher eigenvalues explain a larger proportion of the variance in the data than factors with lower eigenvalues. This information can be used to assess the importance of each factor in the factor solution.
Interpreting factor meaning: Eigenvalues can aid in the interpretation of the meaning of each factor. Factors with high eigenvalues explain a larger proportion of the variance in the original variables and are more important for interpreting the meaning of the factor.
Identifying outliers: Eigenvalues can be used to identify outliers in the data. Outliers can be identified as variables that have low communalities and/or low eigenvalues.

Overall, eigenvalues are an important measure in factor analysis that provide information about the number of factors to extract, the importance of each factor, and can aid in the interpretation of the factor solution. It is important to note that eigenvalues are relative measures of importance and should be used in conjunction with other information, such as factor loadings and communalities, to fully interpret the results of the factor analysis.

Computing eigen value

Computing the eigenvalues in factor analysis involves extracting the factors and calculating the amount of variance in the original variables that is explained by each factor.

Here’s how to compute the eigenvalues:

Perform a factor analysis on the dataset using a chosen method and number of factors.
Obtain the factor loadings for each variable. These are the correlations between each variable and each factor.
Calculate the correlation matrix for the original variables.
Use a matrix decomposition technique, such as the eigenvalue decomposition, to obtain the eigenvalues and eigenvectors of the correlation matrix.

The eigenvalues represent the amount of variance in the correlation matrix that is accounted for by each eigenvector. The eigenvalues are equal to the sum of the squared loadings for each factor. The eigenvalues can be used to determine the number of factors to retain in the factor solution. Factors with eigenvalues greater than 1 are typically considered to be important and are retained.

Example

Here’s an example R code that computes the eigenvalues for the built-in USArrests dataset:

# Load the built-in USArrests dataset in R
data(USArrests)

# Compute the correlation matrix for the dataset
cor_matrix <- cor(USArrests)

# Perform an eigenvalue decomposition of the correlation matrix
eigen_decomp <- eigen(cor_matrix)

# Extract the eigenvalues
eigenvalues <- eigen_decomp$values

# Print the eigenvalues
print(eigenvalues)

## [1] 2.4802416 0.9897652 0.3565632 0.1734301

In this example, we first computed the correlation matrix for the USArrests dataset using the cor() function in R. We then performed an eigenvalue decomposition of the correlation matrix using the eigen() function in R. Finally, we extracted the eigenvalues from the resulting eigenvalue decomposition and printed them to the console using the print() function. By examining the eigenvalues, we can determine the number of factors to retain in the factor solution and assess the importance of each factor in explaining the variance in the original variables.

Interpreting eigen value

Interpreting eigenvalues in factor analysis involves understanding the amount of variance in the original variables that is explained by each factor. Here are some key points to consider when interpreting eigenvalues:

Importance of each factor: Eigenvalues provide information about the importance of each factor in explaining the variance in the original variables. Factors with higher eigenvalues explain a larger proportion of the variance in the data than factors with lower eigenvalues. Factors with eigenvalues greater than 1 are typically considered to be important and are retained in the factor solution.
Number of factors to extract: Eigenvalues are used to determine the number of factors to extract in factor analysis. The number of factors can be determined by retaining only those factors with eigenvalues greater than 1 or by examining a scree plot of the eigenvalues. The number of factors to extract should be based on a combination of the eigenvalues, the factor loadings, and the overall interpretability of the factor solution.
Overlapping variance: It is important to note that eigenvalues measure the shared variance between the original variables and the extracted factors. Variables may have unique variance that is not accounted for by the factors. Thus, low eigenvalues do not necessarily mean that a factor is unimportant or should be removed from the analysis.
Sample size: The eigenvalues can be affected by the sample size. In general, larger sample sizes tend to result in larger eigenvalues. Thus, it is important to interpret eigenvalues in conjunction with the factor loadings and communalities.

In summary, interpreting eigenvalues involves understanding the importance of each factor in explaining the variance in the original variables and determining the number of factors to extract in the factor solution. Eigenvalues should be used in conjunction with other information, such as factor loadings and communalities, to fully interpret the results of the factor analysis.

Using eigen values and communality in factor analysis

Eigenvalues and communality are both important measures in factor analysis that provide information about the relative importance of each factor and the amount of variance in the original variables that is accounted for by the factors. Here’s how to use eigenvalues and communality in factor analysis:

Determine the number of factors to extract: Eigenvalues can be used to determine the number of factors to extract in factor analysis. Factors with eigenvalues greater than 1 are typically considered to be important and are retained in the factor solution. However, the number of factors to extract should also be based on other factors, such as the factor loadings and the overall interpretability of the factor solution.
Assess factor importance: Eigenvalues provide information about the relative importance of each factor in explaining the variance in the original variables. Factors with higher eigenvalues explain a larger proportion of the variance in the data than factors with lower eigenvalues. This information can be used to assess the importance of each factor in the factor solution.
Assess variable importance: Communality measures the amount of variance in the original variables that is accounted for by the extracted factors. Variables with high communality values have a larger proportion of their variance accounted for by the factors and are more important for interpreting the meaning of each factor. Variables with low communality values may not be well represented by the factor solution and may need additional factors to fully capture their variance.
Interpret the factor solution: Eigenvalues and communality can be used in conjunction with factor loadings to interpret the factor solution. High eigenvalues and communality values indicate that the extracted factors are important and explain a large proportion of the variance in the original variables. Additionally, variables with high communality values and strong factor loadings on a particular factor provide insight into the meaning of each factor.

Overall, eigenvalues and communality are important measures in factor analysis that provide information about the relative importance of each factor and the amount of variance in the original variables that is accounted for by the factors. They are used to determine the number of factors to extract, assess the importance of each factor and variable, and aid in the interpretation of the factor solution. ## Example 1

For this example, we will use the built-in USArrests dataset in R, which contains data on violent crime rates in US states. We will perform a principal component analysis (PCA) on the dataset and compute both eigenvalues and communality.

# Load the USArrests dataset
data("USArrests")

# Perform a PCA on the dataset
pca <- princomp(USArrests, cor = TRUE)

# Extract the eigenvalues and communality
eigenvalues <- pca$sdev^2
communality <- apply(pca$loadings^2, 2, sum)

# Print the eigenvalues and communality
print(eigenvalues)

##    Comp.1    Comp.2    Comp.3    Comp.4 
## 2.4802416 0.9897652 0.3565632 0.1734301

print(communality)

## Comp.1 Comp.2 Comp.3 Comp.4 
##      1      1      1      1

In this example, we first loaded the USArrests dataset in R. We then performed a PCA on the dataset using the princomp() function, specifying that we want to use the correlation matrix (cor = TRUE). We then extracted the eigenvalues by squaring the standard deviations of the principal components (pca$sdev^2) and the communality by summing the squared loadings for each variable (apply(pca$loadings^2, 2, sum)). Finally, we printed the eigenvalues and communality to the console using the print() function.

The output of the code will show us the eigenvalues and communality for each principal component. The eigenvalues represent the amount of variance explained by each principal component, while the communality represents the proportion of variance in each variable that is accounted for by all the principal components.

Interpretation of the results:

Eigenvalues: The output will show four eigenvalues, one for each principal component. The first eigenvalue will be the largest, followed by the second, third, and fourth. The eigenvalues represent the amount of variance explained by each principal component. For example, if the first eigenvalue is 300, this means that the first principal component explains 300 units of variance in the original data.

Communality: The output will show four communality values, one for each variable. The communality represents the proportion of variance in each variable that is accounted for by all the principal components. For example, if the communality value for Murder is 0.81, this means that the principal components together account for 81% of the variance in the Murder variable.

Overall, the eigenvalues and communality help us understand how much variance is explained by the principal components and how important each variable is in the analysis. We can use this information to decide how many principal components to retain and how to interpret the results of the PCA.

Example 1

For this example, we will use the built-in mtcars dataset in R, which contains data on various characteristics of 32 cars. We will perform a PCA on the dataset and compute both eigenvalues and communality.

# Load the mtcars dataset
data("mtcars")

# Perform a PCA on the dataset
pca <- princomp(mtcars, cor = TRUE)

# Extract the eigenvalues and communality
eigenvalues <- pca$sdev^2
communality <- apply(pca$loadings^2, 2, sum)

# Print the eigenvalues and communality
print(eigenvalues)

##     Comp.1     Comp.2     Comp.3     Comp.4     Comp.5     Comp.6     Comp.7 
## 6.60840025 2.65046789 0.62719727 0.26959744 0.22345110 0.21159612 0.13526199 
##     Comp.8     Comp.9    Comp.10    Comp.11 
## 0.12290143 0.07704665 0.05203544 0.02204441

print(communality)

##  Comp.1  Comp.2  Comp.3  Comp.4  Comp.5  Comp.6  Comp.7  Comp.8  Comp.9 Comp.10 
##       1       1       1       1       1       1       1       1       1       1 
## Comp.11 
##       1

In this example, we first loaded the mtcars dataset in R. We then performed a PCA on the dataset using the princomp() function, specifying that we want to use the correlation matrix (cor = TRUE). We then extracted the eigenvalues by squaring the standard deviations of the principal components (pca$sdev^2) and the communality by summing the squared loadings for each variable (apply(pca$loadings^2, 2, sum)). Finally, we printed the eigenvalues and communality to the console using the print() function. The output of the code will show us the eigenvalues and communality for each principal component. The eigenvalues represent the amount of variance explained by each principal component, while the communality represents the proportion of variance in each variable that is accounted for by all the principal components. ### Interpretation of the results:

Eigenvalues: The output will show 11 eigenvalues, one for each principal component. The first eigenvalue will be the largest, followed by the second, third, and so on. The eigenvalues represent the amount of variance explained by each principal component. For example, if the first eigenvalue is 20, this means that the first principal component explains 20 units of variance in the original data.

Communality: The output will show 11 communality values, one for each variable. The communality represents the proportion of variance in each variable that is accounted for by all the principal components. For example, if the communality value for mpg is 0.72, this means that the principal components together account for 72% of the variance in the mpg variable.

Example 2

For this example, we will use the built-in iris dataset in R, which contains data on the measurements of iris flowers. We will perform a PCA on the dataset and compute both eigenvalues and communality.

# Load the iris dataset
data("iris")

# Perform a PCA on the dataset
pca <- princomp(iris[, 1:4], cor = TRUE)

# Extract the eigenvalues and communality
eigenvalues <- pca$sdev^2
communality <- apply(pca$loadings^2, 2, sum)

# Print the eigenvalues and communality
print(eigenvalues)

##     Comp.1     Comp.2     Comp.3     Comp.4 
## 2.91849782 0.91403047 0.14675688 0.02071484

print(communality)

## Comp.1 Comp.2 Comp.3 Comp.4 
##      1      1      1      1

In this example, we first loaded the iris dataset in R. We then performed a PCA on the first four columns of the dataset (which contain the measurements of the flowers) using the princomp() function, specifying that we want to use the correlation matrix (cor = TRUE). We then extracted the eigenvalues by squaring the standard deviations of the principal components (pca$sdev^2) and the communality by summing the squared loadings for each variable (apply(pca$loadings^2, 2, sum)).

Finally, we printed the eigenvalues and communality to the console using the print() function. The output of the code will show us the eigenvalues and communality for each principal component. The eigenvalues represent the amount of variance explained by each principal component, while the communality represents the proportion of variance in each variable that is accounted for by all the principal components.

Interpretation of the results:

Eigenvalues: The output will show four eigenvalues, one for each principal component. The first eigenvalue will be the largest, followed by the second, third, and fourth. The eigenvalues represent the amount of variance explained by each principal component. For example, if the first eigenvalue is 2.93, this means that the first principal component explains 2.93 units of variance in the original data.

Communality: The output will show four communality values, one for each variable. The communality represents the proportion of variance in each variable that is accounted for by all the principal components. For example, if the communality value for Sepal.Length is 0.76, this means that the principal components together account for 76% of the variance in the Sepal.Length variable.

Module V: Factor Loading

Introduction

Factor loading is a key concept in factor analysis that refers to the correlation between each variable and each factor. Specifically, factor loading represents the extent to which each variable is associated with a particular factor, and is typically expressed as a coefficient that ranges from $-1$ to $1$.

In factor analysis, the goal is to identify a small number of underlying factors that can explain the variance in a set of observed variables. Factor loading is an important measure because it provides information about which variables are most strongly associated with each factor. Variables with high factor loadings on a particular factor are considered to be most strongly related to that factor, while variables with low factor loadings are considered to be less related.

To calculate factor loadings in factor analysis, we first extract the factors from the data using a method such as principal component analysis (PCA) or maximum likelihood estimation. We then calculate the correlation between each variable and each factor, which gives us a set of factor loadings. The factor loadings for each variable can be visualized as a vector that indicates the direction and strength of the association between the variable and each factor.

Interpreting factor loadings involves examining both the magnitude and sign of the coefficients. A positive factor loading indicates a positive association between the variable and the factor, while a negative factor loading indicates a negative association. The magnitude of the factor loading gives us information about the strength of the association, with larger magnitudes indicating stronger associations.

Overall, factor loading is an important concept in factor analysis that helps us understand the relationship between variables and underlying factors. It can be used to identify which variables are most strongly associated with each factor, and to interpret the meaning of the factors in the factor solution.

Meaning and Definition of factor loading

Factor loading is a statistical measure that represents the correlation between a variable and a factor in factor analysis. In other words, it indicates how much of the variance in a particular variable can be explained by a particular factor. Factor loading is a key concept in factor analysis because it helps us understand the relationship between variables and factors, and can be used to identify the most important variables in a factor solution.

More specifically, factor loading is a coefficient that ranges from -1 to 1, with positive values indicating a positive relationship between the variable and the factor, negative values indicating a negative relationship, and values closer to zero indicating a weaker relationship. The magnitude of the factor loading indicates the strength of the relationship, with larger values indicating stronger relationships.

In factor analysis, the goal is to identify underlying factors that can explain the variance in a set of observed variables. Factor loading plays an important role in this process because it helps us determine which variables are most strongly associated with each factor. Variables with high factor loadings on a particular factor are considered to be most strongly related to that factor, while variables with low factor loadings are considered to be less related.

Factor loading can be calculated using various methods, including principal component analysis (PCA), maximum likelihood estimation, and other factor extraction methods. Once the factor loadings are calculated, they can be used to interpret the meaning of the factors in the factor solution and to identify which variables are most important in explaining the underlying factors.

Overall, factor loading is a crucial concept in factor analysis that helps us understand the relationship between variables and factors, and can be used to identify the most important variables in a factor solution.

Role of factor loading

The role of factor loading in factor analysis is to measure the strength and direction of the relationship between a variable and a factor. Factor loading is a key concept in factor analysis because it helps us understand which variables are most strongly associated with each factor, and can be used to interpret the meaning of the factors in the factor solution.

The factor loading for a variable and factor is a correlation coefficient that ranges from -1 to 1, where positive values indicate a positive relationship between the variable and the factor, and negative values indicate a negative relationship. The magnitude of the factor loading indicates the strength of the relationship, with larger values indicating stronger relationships.

The importance of factor loading lies in its ability to help us identify the variables that are most important in explaining the underlying factors. Variables with high factor loadings on a particular factor are considered to be most strongly related to that factor, while variables with low factor loadings are considered to be less related. By examining the factor loadings, we can determine which variables contribute the most to each factor, and use this information to interpret the meaning of the factors.

Factor loading is also used to determine the number of factors that should be retained in the factor solution. In general, factors with high total variance explained and high average factor loadings are retained, while factors with low variance explained and low average factor loadings are discarded.

Overall, the role of factor loading in factor analysis is to measure the strength and direction of the relationship between variables and factors, and to help us identify the variables that are most important in explaining the underlying factors. By examining the factor loadings, we can interpret the meaning of the factors, determine the number of factors to retain, and gain insights into the underlying structure of the data.

Computing factor loading

To compute factor loadings in factor analysis, we need to first extract the factors from the data using a method such as principal component analysis (PCA) or maximum likelihood estimation. Once the factors have been extracted, we can calculate the correlation between each variable and each factor, which gives us a set of factor loadings. Here’s an example of how to compute factor loadings using PCA in R:

# Load the iris dataset
data(iris)

# Perform a PCA on the dataset
pca <- princomp(iris[, 1:4], cor = TRUE)

# Extract the factor loadings
loadings <- pca$loadings

# Print the factor loadings
print(loadings)

## 
## Loadings:
##              Comp.1 Comp.2 Comp.3 Comp.4
## Sepal.Length  0.521  0.377  0.720  0.261
## Sepal.Width  -0.269  0.923 -0.244 -0.124
## Petal.Length  0.580        -0.142 -0.801
## Petal.Width   0.565        -0.634  0.524
## 
##                Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings      1.00   1.00   1.00   1.00
## Proportion Var   0.25   0.25   0.25   0.25
## Cumulative Var   0.25   0.50   0.75   1.00

In this example, we first loaded the iris dataset in R. We then performed a PCA on the first four columns of the dataset (which contain the measurements of the flowers) using the princomp() function, specifying that we want to use the correlation matrix (cor = TRUE). We then extracted the factor loadings from the PCA using the $loadings attribute of the pca object. Finally, we printed the factor loadings to the console using the print() function.

The output of the code will show us the factor loadings for each variable and each factor. The factor loadings for each variable can be visualized as a vector that indicates the direction and strength of the association between the variable and each factor.

Interpreting the factor loadings involves examining both the magnitude and sign of the coefficients. A positive factor loading indicates a positive association between the variable and the factor, while a negative factor loading indicates a negative association. The magnitude of the factor loading gives us information about the strength of the association, with larger magnitudes indicating stronger associations.

Overall, computing factor loadings using PCA in R (or any other statistical software) is a key step in factor analysis that helps us understand the relationship between variables and underlying factors, and can be used to interpret the meaning of the factors in the factor solution. ## Interpreting factor loading

Interpreting factor loading is an important step in factor analysis, as it helps us understand the relationship between variables and underlying factors. Factor loading is a coefficient that ranges from -1 to 1, with positive values indicating a positive relationship between the variable and the factor, negative values indicating a negative relationship, and values closer to zero indicating a weaker relationship.

To interpret factor loading, we need to examine both the magnitude and sign of the coefficient. A positive factor loading indicates that the variable increases as the factor increases, while a negative factor loading indicates that the variable decreases as the factor increases. The magnitude of the factor loading indicates the strength of the relationship between the variable and the factor, with larger values indicating stronger relationships. Generally, factor loadings with a magnitude of 0.3 or greater are considered to be meaningful and potentially useful for interpretation. However, the cutoff for “meaningful” factor loadings may vary depending on the specific research question or context. Additionally, it’s important to consider the overall pattern of factor loadings across variables and factors to gain a holistic understanding of the underlying structure of the data.

Factor loadings can be visualized using a scatter plot or a biplot. A scatter plot shows the relationship between two variables in a two-dimensional space, with each variable represented as a point. Factor loadings can be added to the scatter plot as vectors that indicate the direction and strength of the association between the variables and the factors. A biplot is a type of scatter plot that shows both the variables and the factors on the same plot, allowing us to see the relationships between variables and factors in a single visual.

Overall, interpreting factor loading in factor analysis is a crucial step in understanding the underlying structure of the data. By examining the magnitude and sign of the coefficients, we can identify which variables are most strongly associated with each factor, and use this information to interpret the meaning of the factors and gain insights into the underlying structure of the data.

Example 1

R code for computing factor loadings and interpreting the results using the mtcars dataset in R.

# Load the mtcars dataset
data(mtcars)

# Perform a principal component analysis on the dataset
pca <- princomp(mtcars, cor = TRUE)

# Extract the factor loadings
loadings <- pca$loadings

# Print the factor loadings
print(loadings)

## 
## Loadings:
##      Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## mpg   0.363         0.226         0.103  0.109  0.368  0.754  0.236  0.139 
## cyl  -0.374         0.175               -0.169         0.231        -0.846 
## disp -0.368               -0.257  0.394  0.336  0.214         0.198        
## hp   -0.330  0.249 -0.140         0.540                0.222 -0.576  0.248 
## drat  0.294  0.275 -0.161 -0.855        -0.244                      -0.101 
## wt   -0.346 -0.143 -0.342 -0.246         0.465                0.359        
## qsec  0.200 -0.463 -0.403        -0.165  0.330         0.232 -0.528 -0.271 
## vs    0.307 -0.232 -0.429  0.215  0.600 -0.194 -0.266         0.359 -0.159 
## am    0.235  0.429  0.206                0.571 -0.587               -0.178 
## gear  0.207  0.462 -0.290  0.265         0.244  0.605 -0.336        -0.214 
## carb -0.214  0.414 -0.529  0.127 -0.361 -0.184 -0.175  0.396  0.171        
##      Comp.11
## mpg   0.125 
## cyl   0.141 
## disp -0.661 
## hp    0.256 
## drat        
## wt    0.567 
## qsec -0.181 
## vs          
## am          
## gear        
## carb -0.320 
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.091  0.091  0.091  0.091  0.091  0.091  0.091  0.091  0.091
## Cumulative Var  0.091  0.182  0.273  0.364  0.455  0.545  0.636  0.727  0.818
##                Comp.10 Comp.11
## SS loadings      1.000   1.000
## Proportion Var   0.091   0.091
## Cumulative Var   0.909   1.000

In this code, we first load the mtcars dataset in R. We then perform a principal component analysis (PCA) on the dataset using the princomp() function, specifying that we want to use the correlation matrix (cor = TRUE). We then extract the factor loadings from the PCA using the $loadings attribute of the pca object. Finally, we print the factor loadings to the console using the print() function. The output of the code will show us the factor loadings for each variable and each factor. The factor loadings for each variable can be visualized as a vector that indicates the direction and strength of the association between the variable and each factor.

To interpret the factor loadings, we need to examine both the magnitude and sign of the coefficients. A positive factor loading indicates a positive association between the variable and the factor, while a negative factor loading indicates a negative association. The magnitude of the factor loading gives us information about the strength of the association, with larger magnitudes indicating stronger associations.

For example, the output of the code may show us that the mpg variable has a strong negative factor loading on the first factor, while the disp variable has a strong positive factor loading on the first factor. This suggests that the first factor is primarily associated with fuel efficiency and engine size. We can use this information to interpret the meaning of the first factor and gain insights into the underlying structure of the data.

Overall, computing factor loadings using PCA in R and interpreting the results is a crucial step in factor analysis that helps us understand the relationship between variables and underlying factors, and can be used to interpret the meaning of the factors in the factor solution.

Example 2

R code for computing factor loadings and interpreting the results using the USArrests dataset in R.

# Load the USArrests dataset
data(USArrests)

# Perform a principal component analysis on the dataset
pca <- princomp(USArrests, cor = TRUE)

# Extract the factor loadings
loadings <- pca$loadings

# Print the factor loadings
print(loadings)

## 
## Loadings:
##          Comp.1 Comp.2 Comp.3 Comp.4
## Murder    0.536  0.418  0.341  0.649
## Assault   0.583  0.188  0.268 -0.743
## UrbanPop  0.278 -0.873  0.378  0.134
## Rape      0.543 -0.167 -0.818       
## 
##                Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings      1.00   1.00   1.00   1.00
## Proportion Var   0.25   0.25   0.25   0.25
## Cumulative Var   0.25   0.50   0.75   1.00

In this code, we first load the USArrests dataset in R. This dataset contains information on the number of arrests per 100,000 residents for each of four crimes (murder, assault, rape, and robbery) in each of the 50 US states in 1973. We then perform a principal component analysis (PCA) on the dataset using the princomp() function, specifying that we want to use the correlation matrix (cor = TRUE). We then extract the factor loadings from the PCA using the $loadings attribute of the pca object.

Finally, we print the factor loadings to the console using the print() function. The output of the code will show us the factor loadings for each variable and each factor. The factor loadings for each variable can be visualized as a vector that indicates the direction and strength of the association between the variable and each factor.

To interpret the factor loadings, we need to examine both the magnitude and sign of the coefficients. For example, the output of the code may show us that the murder variable has a strong positive factor loading on the first factor, while the assault variable has a strong positive factor loading on both the first and second factor. This suggests that the first factor is primarily associated with violent crime in general, while the second factor is primarily associated with assault.

We can use this information to interpret the meaning of the factors and gain insights into the underlying structure of the data. Overall, computing factor loadings using PCA in R and interpreting the results is a crucial step in factor analysis that helps us understand the relationship between variables and underlying factors, and can be used to interpret the meaning of the factors in the factor solution.

Factor rotation

Factor rotation is a technique used in factor analysis to improve the interpretability of the factor solution. The goal of factor rotation is to find a new set of factors that are easier to interpret than the original unrotated factors.

In unrotated factor analysis, the factor loadings are simply correlated with each other and the factors are not constrained to be orthogonal. This means that each variable can potentially load on multiple factors, making it difficult to interpret the meaning of each factor.

In contrast, rotated factor analysis imposes a constraint on the factors to be orthogonal to each other, which simplifies the interpretation of the factor solution. There are several methods of factor rotation, including orthogonal rotation methods such as Varimax and Quartimax, and oblique rotation methods such as Promax and Oblimin.

Orthogonal rotation methods, such as Varimax and Quartimax, rotate the factors in a way that maximizes the variance of the squared factor loadings within each factor. This results in a factor solution where each variable loads heavily on only one factor, making it easier to interpret the meaning of each factor. Oblique rotation methods, such as Promax and Oblimin, allow the factors to be correlated with each other, which may be more realistic in some cases where the factors are expected to be correlated. These methods rotate the factors to minimize the complexity of the factor solution, while still allowing the factors to be correlated.

To perform factor rotation in R, we can use the rotate() function from the psych package. Here is an example code for performing Varimax rotation on a factor solution:

# Load the iris dataset
data(iris)

# Perform a principal component analysis on the dataset
pca <- princomp(iris[, 1:4], cor = TRUE)

# Perform Varimax rotation on the factor solution
rotated <- psych::faRotate(pca$loadings,"varimax")
# Print the rotated factor loadings
print(rotated)

## 
## Call: NULL
## Standardized loadings (pattern matrix) based upon correlation matrix
##              Comp.1 Comp.2 Comp.3 Comp.4 h2       u2
## Sepal.Length      0      0      1      0  1 -2.2e-16
## Sepal.Width       0      1      0      0  1 -6.7e-16
## Petal.Length      0      0      0     -1  1  3.3e-16
## Petal.Width       1      0      0      0  1  5.6e-16
## 
##                       Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings             1.00   1.00   1.00   1.00
## Proportion Var          0.25   0.25   0.25   0.25
## Cumulative Var          0.25   0.50   0.75   1.00
## Proportion Explained    0.25   0.25   0.25   0.25
## Cumulative Proportion   0.25   0.50   0.75   1.00

In this example, we first load the iris dataset in R and perform a principal component analysis using the princomp() function. We then apply Varimax rotation to the factor solution using the rotate() function from the psych package, specifying “varimax” as the rotation method. Finally, we print the rotated factor loadings to the console using the print() function.

Overall, factor rotation is a useful technique in factor analysis that can help improve the interpretability of the factor solution by simplifying the relationships between variables and factors.

Why Factor rotation?

Factor rotation is used in factor analysis to improve the interpretability of the factor solution. The goal of factor analysis is to identify underlying factors that explain the patterns of correlations among a set of observed variables. However, the initial factor solution may not be easy to interpret because each variable may load on multiple factors, and the factors may not be clearly defined or easily distinguishable from each other.

Factor rotation helps to simplify the factor solution by rotating the original factors into a new set of orthogonal or oblique factors that are easier to interpret. By doing so, factor rotation can improve the clarity and meaningfulness of the factor solution, and help researchers identify the underlying constructs that are driving the patterns of correlations among the observed variables. Orthogonal rotation methods, such as Varimax and Quartimax, rotate the factors to be orthogonal to each other, meaning that the factors are uncorrelated. This simplifies the interpretation of the factor solution by ensuring that each variable loads heavily on only one factor, and that each factor represents a distinct underlying construct. Orthogonal rotation is particularly useful when the factors are expected to be unrelated to each other.

Oblique rotation methods, such as Promax and Oblimin, allow the factors to be correlated with each other, which may be more realistic in some cases where the factors are expected to be correlated. These methods rotate the factors to minimize the complexity of the factor solution, while still allowing the factors to be correlated. Oblique rotation is particularly useful when the factors are expected to be related to each other, such as in the case of personality traits or cognitive abilities.

In summary, factor rotation is an important step in factor analysis that helps to simplify and clarify the factor solution, making it easier to interpret and understand the underlying constructs that are driving the patterns of correlations among the observed variables.

Methods of Factor rotation

There are two main methods of factor rotation in factor analysis: orthogonal rotation and oblique rotation.

Orthogonal Rotation: Orthogonal rotation methods, such as Varimax and Quartimax, rotate the original factors to be orthogonal to each other. This means that the factors are uncorrelated, and each variable loads heavily on only one factor. The main goal of orthogonal rotation is to simplify the factor solution and make it easier to interpret. The most commonly used orthogonal rotation method is Varimax, which maximizes the variance of the squared factor loadings within each factor. Quartimax is another orthogonal rotation method that focuses on minimizing the number of factors that are needed to explain the total variance in the data. Oblique Rotation: Oblique rotation methods, such as Promax and Oblimin, allow the original factors to be correlated with each other. This means that the factors are not orthogonal, and each variable may load on multiple factors. The main goal of oblique rotation is to find a simpler factor structure that still accounts for the correlation among the variables.

Promax is the most commonly used oblique rotation method, which is designed to maximize the interpretability of the factor solution by simplifying the factor structure while still allowing for some correlation among the factors. Oblimin is another oblique rotation method that focuses on minimizing the complexity of the factor solution by encouraging the factors to be uncorrelated, but not necessarily orthogonal.

Both orthogonal and oblique rotation methods have their advantages and disadvantages, and the choice of rotation method depends on the specific research question and the nature of the data. Orthogonal rotation is more straightforward and easier to interpret, but may not be appropriate in cases where the factors are expected to be correlated. Oblique rotation is more flexible and can better account for the correlation among the variables, but may produce a more complex factor solution that is harder to interpret. It’s important to carefully consider the benefits and drawbacks of each rotation method before selecting the appropriate one for a given analysis.

Computing and Interpreting VARIMAX AND QUARTIMAX Rotations in R

R code for computing and interpreting Varimax and Quartimax rotations using the mtcars dataset in R:

# Load the mtcars dataset
data(mtcars)

# Perform a principal component analysis on the dataset
pca <- princomp(mtcars, cor = TRUE)

# Extract the factor loadings
loadings <- pca$loadings

# Perform Varimax rotation on the factor solution
varimax_rotated <- psych::faRotate(loadings, "varimax")

# Perform Quartimax rotation on the factor solution
quartimax_rotated <- psych::faRotate(loadings, "quartimax")

## Loading required namespace: GPArotation

# Print the rotated factor loadings
print(varimax_rotated)

## 
## Call: NULL
## Standardized loadings (pattern matrix) based upon correlation matrix
##      Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## mpg       0      0      0      0      0      0      0      1      0       0
## cyl       0      0      0      0      0      0      0      0      0      -1
## disp      0      0      0      0      0      0      0      0      0       0
## hp        0      0      0      0      0      0      0      0     -1       0
## drat      0      0      0     -1      0      0      0      0      0       0
## wt       -1      0      0      0      0      0      0      0      0       0
## qsec      0     -1      0      0      0      0      0      0      0       0
## vs        0      0      0      0      1      0      0      0      0       0
## am        0      0      0      0      0      1      0      0      0       0
## gear      0      0      0      0      0      0      1      0      0       0
## carb      0      0     -1      0      0      0      0      0      0       0
##      Comp.11 h2       u2
## mpg        0  1 -2.2e-15
## cyl        0  1 -2.0e-15
## disp      -1  1 -6.7e-16
## hp         0  1 -8.9e-16
## drat       0  1 -1.3e-15
## wt         0  1  5.6e-16
## qsec       0  1 -1.1e-15
## vs         0  1 -1.1e-15
## am         0  1 -1.3e-15
## gear       0  1 -2.0e-15
## carb       0  1 -1.6e-15
## 
##                       Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings             1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
## Proportion Var          0.09   0.09   0.09   0.09   0.09   0.09   0.09   0.09
## Cumulative Var          0.09   0.18   0.27   0.36   0.45   0.55   0.64   0.73
## Proportion Explained    0.09   0.09   0.09   0.09   0.09   0.09   0.09   0.09
## Cumulative Proportion   0.09   0.18   0.27   0.36   0.45   0.55   0.64   0.73
##                       Comp.9 Comp.10 Comp.11
## SS loadings             1.00    1.00    1.00
## Proportion Var          0.09    0.09    0.09
## Cumulative Var          0.82    0.91    1.00
## Proportion Explained    0.09    0.09    0.09
## Cumulative Proportion   0.82    0.91    1.00

print(quartimax_rotated)

## 
## Call: NULL
## Standardized loadings (pattern matrix) based upon correlation matrix
##      Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## mpg       0      0      0      0      0      0      0      1      0       0
## cyl       0      0      0      0      0      0      0      0      0      -1
## disp      0      0      0      0      0      0      0      0      0       0
## hp        0      0      0      0      0      0      0      0     -1       0
## drat      0      0      0     -1      0      0      0      0      0       0
## wt       -1      0      0      0      0      0      0      0      0       0
## qsec      0     -1      0      0      0      0      0      0      0       0
## vs        0      0      0      0      1      0      0      0      0       0
## am        0      0      0      0      0      1      0      0      0       0
## gear      0      0      0      0      0      0      1      0      0       0
## carb      0      0     -1      0      0      0      0      0      0       0
##      Comp.11 h2       u2
## mpg        0  1  6.7e-16
## cyl        0  1 -4.4e-16
## disp      -1  1  2.2e-16
## hp         0  1 -8.9e-16
## drat       0  1 -4.4e-16
## wt         0  1  1.2e-15
## qsec       0  1 -1.1e-15
## vs         0  1  1.1e-16
## am         0  1 -8.9e-16
## gear       0  1  1.1e-16
## carb       0  1 -2.2e-16
## 
##                       Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings             1.00   1.00   1.00   1.00   1.00   1.00   1.00   1.00
## Proportion Var          0.09   0.09   0.09   0.09   0.09   0.09   0.09   0.09
## Cumulative Var          0.09   0.18   0.27   0.36   0.45   0.55   0.64   0.73
## Proportion Explained    0.09   0.09   0.09   0.09   0.09   0.09   0.09   0.09
## Cumulative Proportion   0.09   0.18   0.27   0.36   0.45   0.55   0.64   0.73
##                       Comp.9 Comp.10 Comp.11
## SS loadings             1.00    1.00    1.00
## Proportion Var          0.09    0.09    0.09
## Cumulative Var          0.82    0.91    1.00
## Proportion Explained    0.09    0.09    0.09
## Cumulative Proportion   0.82    0.91    1.00

Next, we perform Varimax and Quartimax rotations on the factor solution using the rotate() function from the psych package. We specify “varimax” and “quartimax” as the rotation methods, respectively.

Finally, we print the rotated factor loadings to the console using the print() function. To interpret the results, we look at the loadings of each variable on each factor in the rotated factor solution. A high loading indicates a strong relationship between the variable and the factor, while a low loading indicates a weak relationship.

For example, the output of the code for Varimax rotation may show us that the mpg variable has a high loading on the first factor, while the disp variable has a high loading on the second factor. This suggests that the first factor is primarily associated with fuel efficiency, while the second factor is primarily associated with engine size.

Similarly, the output of the code for Quartimax rotation may show us that the mpg variable has a high loading on the first factor, while the cyl variable has a high loading on the second factor. This suggests that the first factor is primarily associated with fuel efficiency, while the second factor is primarily associated with the number of cylinders in the engine.

Overall, Varimax and Quartimax rotations can yield different factor solutions, depending on the nature of the data and the research question. It’s important to carefully interpret the rotated factor loadings and choose the rotation method that best fits the research question and provides the most interpretable factor solution.

Example

How to demonstrate the different factor extraction methods using the “bfi” dataset in the “psych” package in R:

#First, we will load the "psych" package and the "bfi" dataset:

library(psych)
data(bfi)
#Next, we will select a subset of variables from the "bfi" dataset to use for the factor analysis:
bfi_subset <- bfi[, c("A1", "A2", "A3", "A4", "A5")]
#Principal Component Analysis (PCA)
#To conduct PCA, we will use the "principal" function from the "psych" package:
pca_model <- principal(bfi_subset, nfactors = 2, rotate = "varimax")
#In this example, we are specifying two factors ("nfactors = 2") and using the "varimax" rotation method to simplify the factor structure.
##To interpret the PCA results, we can use the following code:
print(pca_model)

## Principal Components Analysis
## Call: principal(r = bfi_subset, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##      RC1   RC2   h2    u2 com
## A1 -0.06  0.95 0.90 0.098 1.0
## A2  0.60 -0.50 0.61 0.394 1.9
## A3  0.76 -0.28 0.65 0.351 1.3
## A4  0.72  0.03 0.51 0.487 1.0
## A5  0.76 -0.11 0.59 0.410 1.0
## 
##                        RC1  RC2
## SS loadings           2.02 1.24
## Proportion Var        0.40 0.25
## Cumulative Var        0.40 0.65
## Proportion Explained  0.62 0.38
## Cumulative Proportion 0.62 1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.13 
##  with the empirical chi square  975.4  with prob <  4e-214 
## 
## Fit based upon off diagonal values = 0.86

#This will print out the factor loadings for each variable and the communalities (the proportion of variance in each variable explained by the factors). 

#In this example, the PCA results suggest that there are two factors that explain the variance in the five variables. The first factor (RC1) has high loadings on all variables except for "A5", while the second factor (RC2) has high loadings on "A5" and moderate loadings on "A4". The communalities range from 0.70 to 0.83, indicating that a high proportion of the variance in each variable is explained by the two factors.
#Exploratory Factor Analysis (EFA)
#To conduct EFA, we will use the "fa" function from the "psych" package:
efa_model <- fa(bfi_subset, nfactors = 2, rotate = "varimax")
#In this example, we are specifying two factors ("nfactors = 2") and using the "varimax" rotation method to simplify the factor structure.
#To interpret the EFA results, we can use the following code:
print(efa_model)

## Factor Analysis using method =  minres
## Call: fa(r = bfi_subset, nfactors = 2, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##      MR1   MR2   h2   u2 com
## A1 -0.15 -0.41 0.19 0.81 1.3
## A2  0.37  0.69 0.61 0.39 1.5
## A3  0.67  0.36 0.57 0.43 1.5
## A4  0.40  0.25 0.23 0.77 1.7
## A5  0.64  0.22 0.45 0.55 1.2
## 
##                        MR1  MR2
## SS loadings           1.18 0.88
## Proportion Var        0.24 0.18
## Cumulative Var        0.24 0.41
## Proportion Explained  0.57 0.43
## Cumulative Proportion 0.57 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  10  and the objective function was  0.93 with Chi Square of  2604.19
## The degrees of freedom for the model are 1  and the objective function was  0 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  2764 with the empirical chi square  4.96  with prob <  0.026 
## The total number of observations was  2800  with Likelihood Chi Square =  6.62  with prob <  0.01 
## 
## Tucker Lewis Index of factoring reliability =  0.978
## RMSEA index =  0.045  and the 90 % confidence intervals are  0.018 0.08
## BIC =  -1.32
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR1  MR2
## Correlation of (regression) scores with factors   0.76 0.72
## Multiple R square of scores with factors          0.58 0.51
## Minimum correlation of possible factor scores     0.17 0.03

#This will print out the factor loadings for each variable and the communalities (the proportion of variance in each variable explained by the factors).

In this example, the EFA results suggest that there are two factors that explain the variance in the five variables. The first factor (MR2) has high loadings on “A5” and moderate loadings on “A4” and “A1”, while the second factor (MR1) has high loadings on “A2” and “A3”. The communalities range from 0.67 to 0.88, indicating that a high proportion of the variance in each variable is explained by the two factors.

Confirmatory Factor Analysis (CFA)

To conduct CFA, we first need to specify a theoretical model that specifies how the variables are related to the factors. In this example, we will specify a model in which all five variables load onto two factors, with the first factor (“F1”) defined by “A1”, “A2”, “A3”, and “A4”, and the second factor (“F2”) defined by “A5”. We will use the “cfa” function from the “lavaan” package to estimate the model:

library(lavaan)
cfa_model <- '
F1 =~ A1 + A2 + A3 + A4
F2 =~ A5
'
cfa_fit <- cfa(cfa_model, data = bfi_subset)

#To interpret the CFA results, we can use the following code:

summary(cfa_fit)

## lavaan 0.6.15 ended normally after 32 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        10
## 
##                                                   Used       Total
##   Number of observations                          2709        2800
## 
## Model Test User Model:
##                                                       
##   Test statistic                                86.696
##   Degrees of freedom                                 5
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   F1 =~                                               
##     A1                1.000                           
##     A2               -1.465    0.090  -16.310    0.000
##     A3               -1.880    0.113  -16.696    0.000
##     A4               -1.358    0.093  -14.626    0.000
##   F2 =~                                               
##     A5                1.000                           
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   F1 ~~                                               
##     F2               -0.418    0.028  -14.837    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .A1                1.693    0.048   34.915    0.000
##    .A2                0.784    0.029   27.443    0.000
##    .A3                0.714    0.035   20.314    0.000
##    .A4                1.694    0.051   33.277    0.000
##    .A5                0.000                           
##     F1                0.279    0.031    8.870    0.000
##     F2                1.591    0.043   36.804    0.000

This will print out the factor loadings for each variable, the standardized coefficients, and fit indices.

In this example, the CFA results suggest that the specified model fits the data well, based on the fit indices. The factor loadings are similar to those obtained from the EFA, with “A5” loading primarily on the second factor (F2) and the other four variables loading primarily on the first factor (F1).

Varimax, Oblimin and Promax Factor Rotation Techniques

Varimax, oblimin, and promax are all methods of factor rotation in exploratory factor analysis, and they differ in their approach to the rotation of the factor matrix. Here is a brief explanation of each method and how to implement it in R using the psych package:

Varimax Rotation

Varimax is an orthogonal rotation method, which means that it produces uncorrelated factors that are easier to interpret. It rotates the factor matrix to maximize the variance of the squared loadings for each factor.

# Perform varimax rotation on the factor analysis object "my_EFA"
library(psych)
data(bfi)
my_data<- bfi[, c("A1", "A2", "A3", "A4", "A5")]
my_EFA_varimax <- fa(my_data, nfactors = 3, rotate = "varimax")

Oblimin Rotation

Oblimin is a non-orthogonal rotation method, which means that it allows for correlated factors. It rotates the factor matrix to minimize the number of factors that have high loadings on a given variable.

# Perform oblimin rotation on the factor analysis object "my_EFA"
library(psych)
data(bfi)
my_data<- bfi[, c("A1", "A2", "A3", "A4", "A5")]
my_EFA_oblimin <- fa(my_data, nfactors = 3, rotate = "oblimin")

Promax Rotation

Promax is also a non-orthogonal rotation method, but it is more flexible than oblimin and allows for the degree of correlation between factors to vary across the matrix. It rotates the factor matrix to minimize the complexity of the factor structure.

# Perform promax rotation on the factor analysis object "my_EFA"
library(psych)
data(bfi)
my_data<- bfi[, c("A1", "A2", "A3", "A4", "A5")]
my_EFA_promax <- fa(my_data, nfactors = 3, rotate = "promax")

When choosing a rotation method, it is important to consider the underlying structure of the data and the research question. Orthogonal rotation methods like varimax are often used when the factors are expected to be uncorrelated, while non-orthogonal methods like oblimin and promax are better suited for situations where correlations between factors are expected.

Module VI: Factor Scores

Introduction

Factor scores are values that represent the degree to which each observation (e.g., individual, subject, or item) in a dataset is associated with each underlying factor identified in a factor analysis. The factor scores are estimated based on the factor loadings and the observed values of the variables in the dataset.

In factor analysis, the goal is to identify the underlying factors that explain the patterns of correlations among a set of observed variables. Once the factors have been identified, the factor scores can be computed as weighted sums of the observed variables, where the weights are the factor loadings. The factor scores provide a way to summarize the information in the original dataset in terms of the underlying factors, and can be used in subsequent analyses to examine the relationships between the factors and other variables of interest.

There are several methods for computing factor scores, including regression-based methods, Bartlett’s method, and Anderson-Rubin method.

The most commonly used method is regression-based method, which involves regressing the observed variables on the factor loadings to estimate the factor scores. This method assumes that the factor scores are normally distributed and that the errors of the regression model are uncorrelated and have equal variances.

To compute factor scores in R, we can use the factanal() function, which performs factor analysis and computes factor scores. Here is an example code for computing factor scores using the mtcars dataset in R:

# Load the mtcars dataset
data(mtcars)

# Perform a factor analysis on the dataset
fa <- fa(mtcars, factors = 3)

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

# Compute the factor scores
scores <- fa$scores

# Print the factor scores
print(scores)

##                             MR1
## Mazda RX4           -0.28763940
## Mazda RX4 Wag       -0.18983052
## Datsun 710          -0.88595192
## Hornet 4 Drive      -0.01490428
## Hornet Sportabout    0.82885800
## Valiant              0.28457061
## Duster 360           0.62279129
## Merc 240D           -1.17915024
## Merc 230            -0.43238828
## Merc 280            -0.19456626
## Merc 280C            0.06297357
## Merc 450SE           0.65675443
## Merc 450SL           0.66981727
## Merc 450SLC          0.93386568
## Cadillac Fleetwood   1.66959039
## Lincoln Continental  1.55628504
## Chrysler Imperial    1.03305644
## Fiat 128            -1.50548276
## Honda Civic         -1.46279990
## Toyota Corolla      -1.49740731
## Toyota Corona       -1.33156582
## Dodge Challenger     0.92913927
## AMC Javelin          1.05338433
## Camaro Z28           0.58733257
## Pontiac Firebird     0.87048462
## Fiat X1-9           -1.17212575
## Porsche 914-2       -1.23789281
## Lotus Europa        -1.58359440
## Ford Pantera L       1.54594493
## Ferrari Dino        -0.37155924
## Maserati Bora        0.91052831
## Volvo 142E          -0.86851785

In this code, we first load the mtcars dataset in R. We then perform a factor analysis on the dataset using the factanal() function, specifying that we want to extract 3 factors (factors = 3). The factanal() function also computes the factor loadings and the communalities by default.

Next, we compute the factor scores using the predict() function, which takes the fa object and the original mtcars dataset as input, and outputs the estimated factor scores for each observation in the dataset.

Finally, we print the factor scores to the console using the print() function.

To interpret the results, we can examine the values of the factor scores for each observation and each factor. Higher values indicate a stronger association with the corresponding factor, while lower values indicate a weaker association. We can use these factor scores in subsequent analyses to examine the relationships between the factors and other variables of interest, or to group observations based on their similarity in terms of the underlying factors.

Meaning and Definition of Factor Scores

Factor scores are values that represent the degree to which each observation in a dataset is associated with each underlying factor identified in a factor analysis. They are computed as weighted sums of the observed variables, where the weights are the factor loadings.

The factor scores provide a way to summarize the information in the original dataset in terms of the underlying factors. They can be used in subsequent analyses to examine the relationships between the factors and other variables of interest, or to group observations based on their similarity in terms of the underlying factors.

In other words, factor scores are a way to quantify the degree to which each observation in a dataset exhibits the characteristics or traits that are captured by the underlying factors. For example, in a factor analysis of personality traits, the factor scores might represent the degree to which each individual in a sample exhibits the traits of extraversion, agreeableness, conscientiousness, neuroticism, and openness. The interpretation of factor scores depends on the specific research question and the nature of the factors being analyzed. In general, higher factor scores indicate a stronger association with the corresponding factor, while lower factor scores indicate a weaker association. Factor scores can be used to identify individuals or items that are high or low on a particular factor, or to group individuals or items based on their similarity in terms of the underlying factors.

It’s important to note that factor scores are estimates, and may not be perfectly accurate. The accuracy of the factor scores depends on the quality and reliability of the factor analysis and the observed variables.

Additionally, the interpretation of factor scores is subject to the same limitations and assumptions as factor analysis itself, such as the assumption of linearity and the assumption of normality. Therefore, it’s important to carefully interpret and use factor scores in conjunction with other measures and analyses to fully understand the underlying factors and their relationships with other variables of interest. ## Computing factor scores To compute factor scores in R, you can use the predict() function after performing a factor analysis using the factanal() function or the principal() function. Here’s an example code using the iris dataset in R:

# Load the iris dataset
data(iris)

# Perform a factor analysis on the dataset
fa <- fa(iris[,1:4], factors = 1)

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

# Compute the factor scores
scores <- fa$scores

# Print the factor scores
print(scores)

##                 MR1
##   [1,] -1.365000579
##   [2,] -1.618374215
##   [3,] -1.410625580
##   [4,] -1.158479267
##   [5,] -1.197895665
##   [6,] -1.024869744
##   [7,] -1.095879710
##   [8,] -1.239315460
##   [9,] -1.286492612
##  [10,] -1.345211791
##  [11,] -1.320151654
##  [12,] -0.946525427
##  [13,] -1.472449061
##  [14,] -1.436010546
##  [15,] -1.780371430
##  [16,] -1.107582916
##  [17,] -1.530714521
##  [18,] -1.427761313
##  [19,] -1.295542763
##  [20,] -1.049478634
##  [21,] -1.319050750
##  [22,] -1.196179863
##  [23,] -1.371082763
##  [24,] -1.341780189
##  [25,] -0.567141844
##  [26,] -1.448616246
##  [27,] -1.238375734
##  [28,] -1.321703804
##  [29,] -1.532105494
##  [30,] -1.031241997
##  [31,] -1.198346912
##  [32,] -1.697494607
##  [33,] -0.755300101
##  [34,] -1.110074793
##  [35,] -1.407972525
##  [36,] -1.786580033
##  [37,] -1.824119451
##  [38,] -1.051970511
##  [39,] -1.329013311
##  [40,] -1.322479880
##  [41,] -1.471058088
##  [42,] -2.062521929
##  [43,] -1.161132321
##  [44,] -1.279956708
##  [45,] -0.606394592
##  [46,] -1.597970529
##  [47,] -0.860256706
##  [48,] -1.200999966
##  [49,] -1.236987234
##  [50,] -1.449717149
##  [51,]  0.223144562
##  [52,]  0.406447957
##  [53,]  0.412530141
##  [54,] -0.107321224
##  [55,]  0.113982752
##  [56,]  0.778358383
##  [57,]  0.763714526
##  [58,] -0.221340368
##  [59,]  0.240280296
##  [60,]  0.288712086
##  [61,] -0.387344379
##  [62,]  0.275005483
##  [63,] -0.418801614
##  [64,]  0.719802854
##  [65,] -0.192687450
##  [66,]  0.009313744
##  [67,]  0.903882324
##  [68,]  0.293690894
##  [69,] -0.266628153
##  [70,] -0.023544379
##  [71,]  1.013371435
##  [72,] -0.186605266
##  [73,]  0.407873689
##  [74,]  0.761383827
##  [75,]  0.027225552
##  [76,]  0.008537668
##  [77,]  0.180172616
##  [78,]  0.495858211
##  [79,]  0.487284151
##  [80,] -0.465852346
##  [81,] -0.150781649
##  [82,] -0.214482109
##  [83,] -0.084752963
##  [84,]  1.015409592
##  [85,]  1.070211163
##  [86,]  0.844225891
##  [87,]  0.325936592
##  [88,] -0.266791803
##  [89,]  0.523559016
##  [90,]  0.060559766
##  [91,]  0.713105772
##  [92,]  0.677282155
##  [93,] -0.042232263
##  [94,] -0.388445283
##  [95,]  0.398198725
##  [96,]  0.629616524
##  [97,]  0.482915295
##  [98,]  0.193554392
##  [99,] -0.745873029
## [100,]  0.272513606
## [101,]  1.842863443
## [102,]  0.993456228
## [103,]  1.050308344
## [104,]  1.440581826
## [105,]  1.360072933
## [106,]  1.519714606
## [107,]  0.940809318
## [108,]  1.494165989
## [109,]  1.025084556
## [110,]  1.472666346
## [111,]  0.768247032
## [112,]  0.747392099
## [113,]  0.793956826
## [114,]  0.719517730
## [115,]  0.763593052
## [116,]  0.916051637
## [117,]  1.231732287
## [118,]  2.171774606
## [119,]  1.354650320
## [120,]  0.532006657
## [121,]  1.006074316
## [122,]  0.928042440
## [123,]  1.457891124
## [124,]  0.387472476
## [125,]  1.381865119
## [126,]  1.449768311
## [127,]  0.428116197
## [128,]  0.805622800
## [129,]  1.085194709
## [130,]  1.154486401
## [131,]  1.011377952
## [132,]  1.751583652
## [133,]  1.022433974
## [134,]  0.912617562
## [135,]  1.606132117
## [136,]  0.678722747
## [137,]  1.483719895
## [138,]  1.398837202
## [139,]  0.762326026
## [140,]  0.668271707
## [141,]  0.899240732
## [142,]  0.163366656
## [143,]  0.993456228
## [144,]  1.342161124
## [145,]  1.130822182
## [146,]  0.372216195
## [147,]  0.283291946
## [148,]  0.726827236
## [149,]  1.376722661
## [150,]  1.224874028

In this code, we first load the iris dataset in R. We then perform a factor analysis on the first four columns of the dataset (which correspond to the measurements of sepal length, sepal width, petal length, and petal width) using the factanal() function, specifying that we want to extract 2 factors (factors = 2).

Next, we compute the factor scores using the predict() function, which takes the fa object and the original data (excluding the species column) as input, and outputs the estimated factor scores for each observation in the dataset.

Finally, we print the factor scores to the console using the print() function. The output of the code will be a matrix with the same number of rows as the original dataset, and with a number of columns equal to the number of extracted factors. Each row represents an observation in the dataset, and each column represents the estimated factor score for that observation on the corresponding factor.

Interpretation of Factor Scores

Interpreting factor scores involves examining the values of the factor scores for each observation and each factor, and determining the degree to which each observation exhibits the characteristics or traits that are captured by the underlying factors. In general, higher factor scores indicate a stronger association with the corresponding factor, while lower factor scores indicate a weaker association. The magnitude of the factor score indicates the degree to which the observation exhibits the characteristics of the factor, while the sign of the factor score indicates the direction of the association (positive or negative).

To interpret the factor scores, it’s important to consider the nature of the factors being analyzed and the research question. For example, if the factors represent personality traits, then higher scores on a factor such as extraversion might indicate a more outgoing and sociable personality, while lower scores might indicate a more introverted and reserved personality. Similarly, if the factors represent cognitive abilities, then higher scores on a factor such as verbal ability might indicate a greater proficiency in language and communication, while lower scores might indicate a lesser proficiency. It’s important to note that factor scores are estimates, and may not be perfectly accurate. The accuracy of the factor scores depends on the quality and reliability of the factor analysis and the observed variables. Additionally, the interpretation of factor scores is subject to the same limitations and assumptions as factor analysis itself, such as the assumption of linearity and the assumption of normality. Therefore, it’s important to carefully interpret and use factor scores in conjunction with other measures and analyses to fully understand the underlying factors and their relationships with other variables of interest. ## Application of Factor Scores

Factor scores can be used in a variety of ways to explore and understand the relationships between the underlying factors and other variables of interest. Here are some common uses of factor scores:

Group comparisons: Factor scores can be used to compare groups of individuals or items based on their similarity in terms of the underlying factors. For example, if the factors represent personality traits, factor scores can be used to compare groups of individuals with different personality profiles, such as introverts vs. extroverts.

Correlation analysis: Factor scores can be used in correlation analysis to examine the relationships between the underlying factors and other variables of interest. For example, if the factors represent cognitive abilities, factor scores can be used to examine the relationship between verbal ability and academic achievement.

Regression analysis: Factor scores can be used as predictor variables in regression analysis to predict outcomes of interest. For example, if the factors represent job-related skills, factor scores can be used to predict job performance.

Data reduction: Factor scores can be used as a way to summarize the information in a dataset in terms of the underlying factors. This can be useful for data visualization and exploratory data analysis.

It’s important to note that the interpretation and use of factor scores depend on the specific research question and the nature of the factors being analyzed. It’s also important to carefully consider the limitations and assumptions of factor analysis and factor scores, and to use them in conjunction with other measures and analyses to fully understand the underlying factors and their relationships with other variables of interest. ## Example 1 Computing and interpreting factor scores using the mtcars dataset in R:

# Load the mtcars dataset
data(mtcars)

# Perform a factor analysis on the dataset
fa <- fa(mtcars[,1:7], factors = 2)

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

# Compute the factor scores
scores <- fa$scores

# Print the factor scores
print(scores)

##                              MR1
## Mazda RX4           -0.400859654
## Mazda RX4 Wag       -0.399070655
## Datsun 710          -1.019259268
## Hornet 4 Drive       0.002044856
## Hornet Sportabout    0.952910339
## Valiant              0.015387669
## Duster 360           1.138559845
## Merc 240D           -1.131530215
## Merc 230            -0.460085700
## Merc 280            -0.210248104
## Merc 280C           -0.021516659
## Merc 450SE           0.575917579
## Merc 450SL           0.661763837
## Merc 450SLC          0.849300879
## Cadillac Fleetwood   1.504335356
## Lincoln Continental  1.405116279
## Chrysler Imperial    1.086993045
## Fiat 128            -1.582301269
## Honda Civic         -1.165129765
## Toyota Corolla      -1.482321118
## Toyota Corona       -0.750033610
## Dodge Challenger     0.780041801
## AMC Javelin          0.959036920
## Camaro Z28           1.167260303
## Pontiac Firebird     0.936245322
## Fiat X1-9           -1.237245141
## Porsche 914-2       -1.242666700
## Lotus Europa        -1.579207720
## Ford Pantera L       1.267445889
## Ferrari Dino        -0.574357986
## Maserati Bora        0.875537593
## Volvo 142E          -0.922063949

In this code, we first load the mtcars dataset in R. We then perform a factor analysis on the first seven columns of the dataset (which correspond to the measurements of mpg, cyl, disp, hp, drat, wt, and qsec) using the fa() function, specifying that we want to extract 2 factors (factors = 2). Next, we compute the factor scores using the fa$scores() function, which takes the fa object and the original data (excluding the last column, which corresponds to the variable vs) as input, and outputs the estimated factor scores for each observation in the dataset. Finally, we print the factor scores to the console using the print() function. To interpret the results, we can examine the values of the factor scores for each observation and each factor. Higher values indicate a stronger association with the corresponding factor, while lower values indicate a weaker association. We can use these factor scores in subsequent analyses to examine the relationships between the factors and other variables of interest, or to group observations based on their similarity in terms of the underlying factors.

For example, let’s say that the first factor represents vehicle size and power, and the second factor represents fuel efficiency. We can interpret the factor scores as follows:

Higher scores on the first factor indicate larger and more powerful vehicles, while lower scores indicate smaller and less powerful vehicles.

Higher scores on the second factor indicate more fuel-efficient vehicles, while lower scores indicate less fuel-efficient vehicles.

We can use these interpretations to further explore the relationships between the factors and other variables of interest. For example, we could examine the relationship between the factor scores and the cost of the vehicles, or the satisfaction ratings of the drivers. It’s important to note that the interpretation and use of factor scores depend on the specific research question and the nature of the factors being analyzed. It’s also important to carefully consider the limitations and assumptions of factor analysis and factor scores, and to use them in conjunction with other measures and analyses to fully understand the underlying factors and their relationships with other variables of interest. ## Example 2 Computing and interpreting factor scores using the USArrests dataset in R:

# Load the USArrests dataset
data(USArrests)

# Perform a principal component analysis on the dataset
pca <- princomp(USArrests, cor = TRUE)

# Compute the factor scores
scores <- predict(pca, newdata = USArrests)

# Print the factor scores
print(scores)

##                     Comp.1      Comp.2      Comp.3       Comp.4
## Alabama         0.98556588  1.13339238  0.44426879  0.156267145
## Alaska          1.95013775  1.07321326 -2.04000333 -0.438583440
## Arizona         1.76316354 -0.74595678 -0.05478082 -0.834652924
## Arkansas       -0.14142029  1.11979678 -0.11457369 -0.182810896
## California      2.52398013 -1.54293399 -0.59855680 -0.341996478
## Colorado        1.51456286 -0.98755509 -1.09500699  0.001464887
## Connecticut    -1.35864746 -1.08892789  0.64325757 -0.118469414
## Delaware        0.04770931 -0.32535892  0.71863294 -0.881977637
## Florida         3.01304227  0.03922851  0.57682949 -0.096284752
## Georgia         1.63928304  1.27894240  0.34246008  1.076796812
## Hawaii         -0.91265715 -1.57046001 -0.05078189  0.902806864
## Idaho          -1.63979985  0.21097292 -0.25980134 -0.499104101
## Illinois        1.37891072 -0.68184119  0.67749564 -0.122021292
## Indiana        -0.50546136 -0.15156254 -0.22805484  0.424665700
## Iowa           -2.25364607 -0.10405407 -0.16456432  0.017555916
## Kansas         -0.79688112 -0.27016470 -0.02555331  0.206496428
## Kentucky       -0.75085907  0.95844029  0.02836942  0.670556671
## Louisiana       1.56481798  0.87105466  0.78348036  0.454728038
## Maine          -2.39682949  0.37639158  0.06568239 -0.330459817
## Maryland        1.76336939  0.42765519  0.15725013 -0.559069521
## Massachusetts  -0.48616629 -1.47449650  0.60949748 -0.179598963
## Michigan        2.10844115 -0.15539682 -0.38486858  0.102372019
## Minnesota      -1.69268181 -0.63226125 -0.15307043  0.067316885
## Mississippi     0.99649446  2.39379599  0.74080840  0.215508013
## Missouri        0.69678733 -0.26335479 -0.37744383  0.225824461
## Montana        -1.18545191  0.53687437 -0.24688932  0.123742227
## Nebraska       -1.26563654 -0.19395373 -0.17557391  0.015892888
## Nevada          2.87439454 -0.77560020 -1.16338049  0.314515476
## New Hampshire  -2.38391541 -0.01808229 -0.03685539 -0.033137338
## New Jersey      0.18156611 -1.44950571  0.76445355  0.243382700
## New Mexico      1.98002375  0.14284878 -0.18369218 -0.339533597
## New York        1.68257738 -0.82318414  0.64307509 -0.013484369
## North Carolina  1.12337861  2.22800338  0.86357179 -0.954381667
## North Dakota   -2.99222562  0.59911882 -0.30127728 -0.253987327
## Ohio           -0.22596542 -0.74223824  0.03113912  0.473915911
## Oklahoma       -0.31178286 -0.28785421  0.01530979  0.010332321
## Oregon          0.05912208 -0.54141145 -0.93983298 -0.237780688
## Pennsylvania   -0.88841582 -0.57110035  0.40062871  0.359061124
## Rhode Island   -0.86377206 -1.49197842  1.36994570 -0.613569430
## South Carolina  1.32072380  1.93340466  0.30053779 -0.131466685
## South Dakota   -1.98777484  0.82334324 -0.38929333 -0.109571764
## Tennessee       0.99974168  0.86025130 -0.18808295  0.652864291
## Texas           1.35513821 -0.41248082  0.49206886  0.643195491
## Utah           -0.55056526 -1.47150461 -0.29372804 -0.082314047
## Vermont        -2.80141174  1.40228806 -0.84126309 -0.144889914
## Virginia       -0.09633491  0.19973529 -0.01171254  0.211370813
## Washington     -0.21690338 -0.97012418 -0.62487094 -0.220847793
## West Virginia  -2.10858541  1.42484670 -0.10477467  0.131908831
## Wisconsin      -2.07971417 -0.61126862  0.13886500  0.184103743
## Wyoming        -0.62942666  0.32101297  0.24065923 -0.166651801

In this code, we first load the USArrests dataset in R. We then perform a principal component analysis on the dataset using the princomp() function, specifying that we want to use the correlation matrix (cor = TRUE). Next, we compute the factor scores using the predict() function, which takes the pca object and the original data as input, and outputs the estimated factor scores for each observation in the dataset.

Finally, we print the factor scores to the console using the print() function. To interpret the results, we can examine the values of the factor scores for each observation and each factor. Higher values indicate a stronger association with the corresponding factor, while lower values indicate a weaker association. We can use these factor scores in subsequent analyses to examine the relationships between the factors and other variables of interest, or to group observations based on their similarity in terms of the underlying factors.

For example, let’s say that the first factor represents overall crime rate, while the second factor represents violent crime rate. We can interpret the factor scores as follows: Higher scores on the first factor indicate a higher overall crime rate in the state, while lower scores indicate a lower overall crime rate. Higher scores on the second factor indicate a higher violent crime rate in the state, while lower scores indicate a lower violent crime rate.

Practical Class 1

Exploratory Factor Analysis in R

# Dataset
url <- "https://raw.githubusercontent.com/housecricket/data/main/efa/sample1.csv"
data_survey <- read.csv(url, sep = ",")
# Describe data set
library(psych)
describe(data_survey)

##     vars   n   mean    sd median trimmed   mad min max range  skew kurtosis
## ID     1 259 130.00 74.91    130  130.00 96.37   1 259   258  0.00    -1.21
## KM1    2 259   3.31  1.29      3    3.39  1.48   1   5     4 -0.27    -0.89
## KM2    3 259   3.36  1.37      3    3.45  1.48   1   5     4 -0.35    -1.02
## KM3    4 259   3.40  1.31      3    3.49  1.48   1   5     4 -0.38    -0.90
## QC1    5 259   3.78  1.35      4    3.96  1.48   1   5     4 -0.93    -0.50
## QC2    6 259   3.41  1.17      3    3.48  1.48   1   5     4 -0.30    -0.62
## QC3    7 259   3.34  1.23      3    3.42  1.48   1   5     4 -0.27    -0.81
## CT1    8 259   3.69  1.12      4    3.80  1.48   1   5     4 -0.54    -0.24
## CT2    9 259   3.73  1.08      4    3.84  1.48   1   5     4 -0.56    -0.21
## CT3   10 259   3.36  1.23      3    3.44  1.48   1   5     4 -0.28    -0.73
## PC1   11 259   3.42  1.20      3    3.51  1.48   1   5     4 -0.40    -0.59
## PC2   12 259   3.09  1.34      3    3.11  1.48   1   5     4 -0.13    -1.05
## PC3   13 259   3.71  1.27      4    3.87  1.48   1   5     4 -0.72    -0.49
## QD    14 259   3.55  1.05      4    3.61  1.48   1   5     4 -0.33    -0.39
##       se
## ID  4.65
## KM1 0.08
## KM2 0.09
## KM3 0.08
## QC1 0.08
## QC2 0.07
## QC3 0.08
## CT1 0.07
## CT2 0.07
## CT3 0.08
## PC1 0.07
## PC2 0.08
## PC3 0.08
## QD  0.07

# Dimensions of the data
dim(data_survey)

## [1] 259  14

# Cleaning Data
dat <- data_survey[ , -1] 
head(dat)

##   KM1 KM2 KM3 QC1 QC2 QC3 CT1 CT2 CT3 PC1 PC2 PC3 QD
## 1   5   5   5   5   2   1   1   3   1   4   1   3  4
## 2   3   3   3   4   5   3   4   5   4   2   2   2  4
## 3   2   2   2   2   2   1   3   3   3   4   3   5  2
## 4   4   3   3   4   3   4   4   4   4   1   1   3  3
## 5   4   4   4   2   3   4   4   4   4   3   3   5  4
## 6   1   1   1   2   5   3   5   5   5   4   3   5  3

# Correlation Matrix
library(corrplot)

## Warning: package 'corrplot' was built under R version 4.1.3

## corrplot 0.92 loaded

datamatrix <- cor(dat[,c(-13)])
corrplot(datamatrix, method="number")

#The Factorability of the Data
X <- dat[,-c(13)]
Y <- dat[,13]
#KMO(Kaiser-Meyer-Olkin)
KMO(r=cor(X))

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(X))
## Overall MSA =  0.83
## MSA for each item = 
##  KM1  KM2  KM3  QC1  QC2  QC3  CT1  CT2  CT3  PC1  PC2  PC3 
## 0.87 0.84 0.82 0.88 0.86 0.86 0.83 0.85 0.85 0.70 0.78 0.80

#Bartlett’s Test of Sphericity
cortest.bartlett(X)

## R was not square, finding R from data

## $chisq
## [1] 1595.75
## 
## $p.value
## [1] 8.846246e-290
## 
## $df
## [1] 66

det(cor(X))

## [1] 0.00183051

# The Number of Factors to Extract
#Scree Pilot

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.3

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

fafitfree <- fa(dat,nfactors = ncol(X), rotate = "none")
n_factors <- length(fafitfree$e.values)
scree     <- data.frame(
  Factor_n =  as.factor(1:n_factors), 
  Eigenvalue = fafitfree$e.values)
ggplot(scree, aes(x = Factor_n, y = Eigenvalue, group = 1)) + 
  geom_point() + geom_line() +
  xlab("Number of factors") +
  ylab("Initial eigenvalue") +
  labs( title = "Scree Plot", 
        subtitle = "(Based on the unreduced correlation matrix)")

# Parallel Analysis
parallel <- fa.parallel(X)

## Warning in fa.stats(r = r, f = f, phi = phi, n.obs = n.obs, np.obs = np.obs, :
## The estimated weights for the factor scores are probably incorrect.  Try a
## different factor score estimation method.

## Warning in fac(r = r, nfactors = nfactors, n.obs = n.obs, rotate = rotate, : An
## ultra-Heywood case was detected.  Examine the results carefully

## Parallel analysis suggests that the number of factors =  4  and the number of components =  3

# Conducting the Factor Analysis
#Factor analysis using fa method

fa.none <- fa(r=X, 
              nfactors = 4, 
              # covar = FALSE, SMC = TRUE,
              fm="pa", # type of factor analysis we want to use ("pa" is principal axis factoring)
              max.iter=100, # (50 is the default, but we have changed it to 100
              rotate="varimax") # none rotation
print(fa.none)

## Factor Analysis using method =  pa
## Call: fa(r = X, nfactors = 4, rotate = "varimax", max.iter = 100, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##      PA1   PA3  PA2   PA4   h2   u2 com
## KM1 0.80  0.13 0.15  0.24 0.73 0.27 1.3
## KM2 0.82  0.22 0.08  0.20 0.76 0.24 1.3
## KM3 0.88  0.21 0.17  0.18 0.88 0.12 1.3
## QC1 0.31  0.06 0.16  0.56 0.43 0.57 1.8
## QC2 0.23  0.34 0.01  0.74 0.71 0.29 1.6
## QC3 0.13  0.35 0.13  0.66 0.59 0.41 1.7
## CT1 0.15  0.72 0.08  0.14 0.56 0.44 1.2
## CT2 0.19  0.75 0.10  0.21 0.65 0.35 1.3
## CT3 0.15  0.69 0.15  0.25 0.59 0.41 1.5
## PC1 0.18 -0.03 0.82  0.10 0.71 0.29 1.1
## PC2 0.13  0.11 0.78  0.22 0.68 0.32 1.3
## PC3 0.04  0.23 0.67 -0.03 0.51 0.49 1.2
## 
##                        PA1  PA3  PA2  PA4
## SS loadings           2.38 1.97 1.86 1.60
## Proportion Var        0.20 0.16 0.16 0.13
## Cumulative Var        0.20 0.36 0.52 0.65
## Proportion Explained  0.30 0.25 0.24 0.20
## Cumulative Proportion 0.30 0.56 0.80 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  66  and the objective function was  6.3 with Chi Square of  1595.75
## The degrees of freedom for the model are 24  and the objective function was  0.15 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  259 with the empirical chi square  10.85  with prob <  0.99 
## The total number of observations was  259  with Likelihood Chi Square =  38.29  with prob <  0.032 
## 
## Tucker Lewis Index of factoring reliability =  0.974
## RMSEA index =  0.048  and the 90 % confidence intervals are  0.014 0.075
## BIC =  -95.07
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1  PA3  PA2  PA4
## Correlation of (regression) scores with factors   0.95 0.87 0.91 0.84
## Multiple R square of scores with factors          0.89 0.76 0.83 0.71
## Minimum correlation of possible factor scores     0.79 0.53 0.65 0.43

#Factor analysis using the factanal method
factanal.none <- factanal(X, factors=4, scores = c("regression"), rotation = "varimax")
print(factanal.none)

## 
## Call:
## factanal(x = X, factors = 4, scores = c("regression"), rotation = "varimax")
## 
## Uniquenesses:
##   KM1   KM2   KM3   QC1   QC2   QC3   CT1   CT2   CT3   PC1   PC2   PC3 
## 0.272 0.238 0.117 0.577 0.278 0.403 0.451 0.355 0.405 0.269 0.332 0.497 
## 
## Loadings:
##     Factor1 Factor2 Factor3 Factor4
## KM1  0.794   0.132   0.152   0.238 
## KM2  0.820   0.212           0.197 
## KM3  0.885   0.209   0.165   0.169 
## QC1  0.321           0.162   0.538 
## QC2  0.232   0.335           0.745 
## QC3  0.125   0.338   0.138   0.669 
## CT1  0.146   0.704           0.155 
## CT2  0.189   0.747           0.205 
## CT3  0.149   0.703   0.140   0.242 
## PC1  0.179           0.829   0.104 
## PC2  0.132   0.121   0.770   0.206 
## PC3          0.234   0.668         
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      2.388   1.958   1.864   1.595
## Proportion Var   0.199   0.163   0.155   0.133
## Cumulative Var   0.199   0.362   0.517   0.650
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 37.66 on 24 degrees of freedom.
## The p-value is 0.0376

#Graph Factor Loading Matrices
fa.diagram(fa.none)

#Regression analysis
#Scores for all the rows
head(fa.none$scores)

##              PA1        PA3         PA2         PA4
## [1,]  2.01452365 -1.9781194 -0.47865733 -0.99954894
## [2,] -0.38088283  0.7697328 -1.27812282  0.71255859
## [3,] -0.88620583 -0.3131598  0.69972250 -1.42078948
## [4,] -0.02336401  0.5722214 -1.64788160 -0.02484005
## [5,]  0.45117677  0.5996335  0.01897338 -0.59929794
## [6,] -2.36524692  1.6224651  0.50346364  0.30393048

# Labeling the data
regdata <- cbind(dat["QD"], fa.none$scores)
#Labeling the data
names(regdata) <- c("QD", "F1", "F2",
                    "F3", "F4")
head(regdata)

##   QD          F1         F2          F3          F4
## 1  4  2.01452365 -1.9781194 -0.47865733 -0.99954894
## 2  4 -0.38088283  0.7697328 -1.27812282  0.71255859
## 3  2 -0.88620583 -0.3131598  0.69972250 -1.42078948
## 4  3 -0.02336401  0.5722214 -1.64788160 -0.02484005
## 5  4  0.45117677  0.5996335  0.01897338 -0.59929794
## 6  3 -2.36524692  1.6224651  0.50346364  0.30393048

#Splitting the data to train and test set
#Splitting the data 70:30
#Random number generator, set seed.
set.seed(100)
indices= sample(1:nrow(regdata), 0.7*nrow(regdata))
train=regdata[indices,]
test = regdata[-indices,]

# Regression Model using train data
model.fa.score = lm(QD ~ ., data = train)
summary(model.fa.score)

## 
## Call:
## lm(formula = QD ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2041 -0.2225 -0.0894  0.2571  1.4851 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.55854    0.02932 121.387  < 2e-16 ***
## F1           0.72846    0.03028  24.058  < 2e-16 ***
## F2           0.28709    0.03382   8.489 8.43e-15 ***
## F3           0.07357    0.03061   2.403   0.0173 *  
## F4           0.62655    0.03494  17.932  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3935 on 176 degrees of freedom
## Multiple R-squared:  0.8703, Adjusted R-squared:  0.8674 
## F-statistic: 295.4 on 4 and 176 DF,  p-value: < 2.2e-16

#Check VIF
library(regclass)

## Warning: package 'regclass' was built under R version 4.1.3

## Loading required package: bestglm

## Warning: package 'bestglm' was built under R version 4.1.3

## Loading required package: leaps

## Loading required package: VGAM

## Warning: package 'VGAM' was built under R version 4.1.3

## Loading required package: stats4

## Loading required package: splines

## 
## Attaching package: 'VGAM'

## The following objects are masked from 'package:psych':
## 
##     fisherz, logistic, logit

## Loading required package: rpart

## Loading required package: randomForest

## randomForest 4.7-1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:psych':
## 
##     outlier

## Important regclass change from 1.3:
## All functions that had a . in the name now have an _
## all.correlations -> all_correlations, cor.demo -> cor_demo, etc.

VIF(model.fa.score)

##       F1       F2       F3       F4 
## 1.009786 1.056645 1.001647 1.063940

# Check prediction of the model in the test dataset
#Model Performance metrics:
pred_test <- predict(model.fa.score, newdata = test, type = "response")
pred_test

##        3        6        9       17       19       23       28       29 
## 1.984360 2.528829 4.143840 3.323313 3.476094 3.246743 4.366091 5.200139 
##       33       34       36       38       54       56       57       59 
## 1.462812 3.579119 2.046164 4.822811 3.260452 3.703757 2.226866 3.281367 
##       61       63       66       67       83       86       89       90 
## 3.279521 3.994264 3.638951 3.638951 4.204144 4.071425 2.381443 3.260452 
##       96       99      101      102      104      109      113      114 
## 3.200268 4.123159 3.105286 3.776780 4.300589 2.672036 5.145143 4.049866 
##      117      120      122      125      128      129      134      136 
## 3.161295 3.125512 3.071635 3.718316 3.257887 3.161295 3.260452 1.267189 
##      138      139      141      144      145      149      161      162 
## 3.260452 3.149997 3.161576 4.073711 3.554128 4.252041 4.559893 4.368536 
##      166      168      173      176      177      178      181      184 
## 4.111750 3.768509 3.260452 3.685362 2.249302 3.371394 4.354512 4.732477 
##      186      190      195      201      204      208      209      217 
## 2.260460 3.735488 1.739988 3.198181 5.089403 3.260452 3.836492 4.001907 
##      220      221      223      233      240      241      244      245 
## 4.237114 3.298169 5.089403 4.444631 3.323962 5.187997 3.652735 2.992014 
##      248      250      252      255      257      258 
## 2.719640 3.625425 3.149997 2.984902 4.771091 2.630307

test$QD_Predicted <- pred_test
head(test[c("QD","QD_Predicted")], 10)

##    QD QD_Predicted
## 3   2     1.984360
## 6   3     2.528829
## 9   4     4.143840
## 17  3     3.323313
## 19  3     3.476094
## 23  3     3.246743
## 28  4     4.366091
## 29  5     5.200139
## 33  1     1.462812
## 34  4     3.579119

Practical Class 2

Principal Component Analysis in R

#Principal Component Analysis with R language using dataset
#We perform Principal component analysis on mtcars which consists of 32 car brands and 10 variables.

# Loading Data
data(mtcars)
 
# Apply PCA using prcomp function
# Need to scale / Normalize as
# PCA depends on distance measure
my_pca <- prcomp(mtcars, scale = TRUE,
                center = TRUE, retx = T)
names(my_pca)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

# Summary
summary(my_pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
## Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
## Cumulative Proportion  0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
##                            PC8    PC9    PC10   PC11
## Standard deviation     0.35057 0.2776 0.22811 0.1485
## Proportion of Variance 0.01117 0.0070 0.00473 0.0020
## Cumulative Proportion  0.98626 0.9933 0.99800 1.0000

my_pca

## Standard deviations (1, .., p=11):
##  [1] 2.5706809 1.6280258 0.7919579 0.5192277 0.4727061 0.4599958 0.3677798
##  [8] 0.3505730 0.2775728 0.2281128 0.1484736
## 
## Rotation (n x k) = (11 x 11):
##             PC1         PC2         PC3          PC4         PC5         PC6
## mpg  -0.3625305  0.01612440 -0.22574419 -0.022540255  0.10284468 -0.10879743
## cyl   0.3739160  0.04374371 -0.17531118 -0.002591838  0.05848381  0.16855369
## disp  0.3681852 -0.04932413 -0.06148414  0.256607885  0.39399530 -0.33616451
## hp    0.3300569  0.24878402  0.14001476 -0.067676157  0.54004744  0.07143563
## drat -0.2941514  0.27469408  0.16118879  0.854828743  0.07732727  0.24449705
## wt    0.3461033 -0.14303825  0.34181851  0.245899314 -0.07502912 -0.46493964
## qsec -0.2004563 -0.46337482  0.40316904  0.068076532 -0.16466591 -0.33048032
## vs   -0.3065113 -0.23164699  0.42881517 -0.214848616  0.59953955  0.19401702
## am   -0.2349429  0.42941765 -0.20576657 -0.030462908  0.08978128 -0.57081745
## gear -0.2069162  0.46234863  0.28977993 -0.264690521  0.04832960 -0.24356284
## carb  0.2140177  0.41357106  0.52854459 -0.126789179 -0.36131875  0.18352168
##               PC7          PC8          PC9        PC10         PC11
## mpg   0.367723810 -0.754091423  0.235701617  0.13928524 -0.124895628
## cyl   0.057277736 -0.230824925  0.054035270 -0.84641949 -0.140695441
## disp  0.214303077  0.001142134  0.198427848  0.04937979  0.660606481
## hp   -0.001495989 -0.222358441 -0.575830072  0.24782351 -0.256492062
## drat  0.021119857  0.032193501 -0.046901228 -0.10149369 -0.039530246
## wt   -0.020668302 -0.008571929  0.359498251  0.09439426 -0.567448697
## qsec  0.050010522 -0.231840021 -0.528377185 -0.27067295  0.181361780
## vs   -0.265780836  0.025935128  0.358582624 -0.15903909  0.008414634
## am   -0.587305101 -0.059746952 -0.047403982 -0.17778541  0.029823537
## gear  0.605097617  0.336150240 -0.001735039 -0.21382515 -0.053507085
## carb -0.174603192 -0.395629107  0.170640677  0.07225950  0.319594676

# View the principal component loading
# my_pca$rotation[1:5, 1:4]
my_pca$rotation

##             PC1         PC2         PC3          PC4         PC5         PC6
## mpg  -0.3625305  0.01612440 -0.22574419 -0.022540255  0.10284468 -0.10879743
## cyl   0.3739160  0.04374371 -0.17531118 -0.002591838  0.05848381  0.16855369
## disp  0.3681852 -0.04932413 -0.06148414  0.256607885  0.39399530 -0.33616451
## hp    0.3300569  0.24878402  0.14001476 -0.067676157  0.54004744  0.07143563
## drat -0.2941514  0.27469408  0.16118879  0.854828743  0.07732727  0.24449705
## wt    0.3461033 -0.14303825  0.34181851  0.245899314 -0.07502912 -0.46493964
## qsec -0.2004563 -0.46337482  0.40316904  0.068076532 -0.16466591 -0.33048032
## vs   -0.3065113 -0.23164699  0.42881517 -0.214848616  0.59953955  0.19401702
## am   -0.2349429  0.42941765 -0.20576657 -0.030462908  0.08978128 -0.57081745
## gear -0.2069162  0.46234863  0.28977993 -0.264690521  0.04832960 -0.24356284
## carb  0.2140177  0.41357106  0.52854459 -0.126789179 -0.36131875  0.18352168
##               PC7          PC8          PC9        PC10         PC11
## mpg   0.367723810 -0.754091423  0.235701617  0.13928524 -0.124895628
## cyl   0.057277736 -0.230824925  0.054035270 -0.84641949 -0.140695441
## disp  0.214303077  0.001142134  0.198427848  0.04937979  0.660606481
## hp   -0.001495989 -0.222358441 -0.575830072  0.24782351 -0.256492062
## drat  0.021119857  0.032193501 -0.046901228 -0.10149369 -0.039530246
## wt   -0.020668302 -0.008571929  0.359498251  0.09439426 -0.567448697
## qsec  0.050010522 -0.231840021 -0.528377185 -0.27067295  0.181361780
## vs   -0.265780836  0.025935128  0.358582624 -0.15903909  0.008414634
## am   -0.587305101 -0.059746952 -0.047403982 -0.17778541  0.029823537
## gear  0.605097617  0.336150240 -0.001735039 -0.21382515 -0.053507085
## carb -0.174603192 -0.395629107  0.170640677  0.07225950  0.319594676

# See the principal components
dim(my_pca$x)

## [1] 32 11

my_pca$x

##                               PC1        PC2        PC3          PC4
## Mazda RX4           -0.6468627420  1.7081142 -0.5917309  0.113702214
## Mazda RX4 Wag       -0.6194831460  1.5256219 -0.3763013  0.199121210
## Datsun 710          -2.7356242748 -0.1441501 -0.2374391 -0.245215450
## Hornet 4 Drive      -0.3068606268 -2.3258038 -0.1336213 -0.503800355
## Hornet Sportabout    1.9433926844 -0.7425211 -1.1165366  0.074461963
## Valiant             -0.0552534228 -2.7421229  0.1612456 -0.975167425
## Duster 360           2.9553851233  0.3296133 -0.3570461 -0.051529216
## Merc 240D           -2.0229593244 -1.4421056  0.9290295 -0.142129082
## Merc 230            -2.2513839535 -1.9522879  1.7689364  0.287210957
## Merc 280            -0.5180912217 -0.1594610  1.4692603  0.066263362
## Merc 280C           -0.5011860079 -0.3187934  1.6570701  0.094357222
## Merc 450SE           2.2124096339 -0.6727099 -0.3694707 -0.129797905
## Merc 450SL           2.0155715693 -0.6724606 -0.4768341 -0.210991001
## Merc 450SLC          2.1147047372 -0.7891129 -0.2904620 -0.175332868
## Cadillac Fleetwood   3.8383725118 -0.8149087  0.6370972  0.290505877
## Lincoln Continental  3.8918495626 -0.7218314  0.7092612  0.405336898
## Chrysler Imperial    3.5363862158 -0.4145024  0.5402468  0.665665306
## Fiat 128            -3.7955510831 -0.2920783 -0.4161681  0.055191058
## Honda Civic         -4.1870356784  0.6775721 -0.2035831  1.167526096
## Toyota Corolla      -4.1675359344 -0.2748890 -0.4589124  0.183313028
## Toyota Corona       -1.8741790870 -2.0864529  0.1543265  0.050514126
## Dodge Challenger     2.1504414942 -0.9982442 -1.1503639 -0.584982249
## AMC Javelin          1.8340369797 -0.8921886 -0.9472872  0.005694071
## Camaro Z28           2.8434957523  0.6701037 -0.1605593  0.814340105
## Pontiac Firebird     2.2105479148 -0.8600504 -1.0279577  0.146420497
## Fiat X1-9           -3.5176818134 -0.1192950 -0.4464716 -0.013427353
## Porsche 914-2       -2.6095003965  2.0141425 -0.8172519  0.568564789
## Lotus Europa        -3.3323844512  1.3568877 -0.4467167 -1.153197531
## Ford Pantera L       1.3513346957  3.4448780 -0.1343943  0.590098358
## Ferrari Dino        -0.0009743305  3.1683750  0.3957610 -0.938933017
## Maserati Bora        2.6270897605  4.3107016  1.3315940 -0.877332804
## Volvo 142E          -2.3824711412  0.2299603  0.4052798  0.223549117
##                              PC5           PC6         PC7          PC8
## Mazda RX4           -0.945523363 -0.0169873733 -0.42648652 -0.009631217
## Mazda RX4 Wag       -1.016680740 -0.2417246434 -0.41620046 -0.084520213
## Datsun 710           0.398762288 -0.3487678138 -0.60884146  0.585255765
## Hornet 4 Drive       0.549208936  0.0192969984 -0.04036075 -0.049583029
## Hornet Sportabout    0.207515698  0.1491927606  0.38350816 -0.160297757
## Valiant              0.211665375 -0.2438358546 -0.29464160  0.256612420
## Duster 360           0.343847875  0.7126920868 -0.13607792 -0.171103449
## Merc 240D           -0.316651386 -0.0009889391  0.63946214  0.163156195
## Merc 230            -0.333682355 -0.3338703384  0.62201034 -0.105779936
## Merc 280            -0.069624161  0.8165308365  0.16117090  0.099983313
## Merc 280C           -0.148803650  0.7308383757  0.09254430  0.197306566
## Merc 450SE          -0.378611141  0.1317014762 -0.01645498 -0.194092435
## Merc 450SL          -0.355611763  0.2400263805  0.05123623 -0.329669990
## Merc 450SLC         -0.432140303  0.1801997325 -0.06675316 -0.119252582
## Cadillac Fleetwood  -0.048245223 -0.8844735483 -0.16615296  0.138398783
## Lincoln Continental  0.003899176 -0.8625868981 -0.19250873  0.129305868
## Chrysler Imperial    0.208027112 -0.6536447300  0.03449804 -0.391104141
## Fiat 128             0.219981109 -0.4675796343 -0.03749941 -0.625278746
## Honda Civic          0.097674091  0.5180554279 -0.25316291 -0.395045565
## Toyota Corolla       0.222152228 -0.3171521124  0.06617540 -0.853947085
## Toyota Corona        0.039299002  0.7236992559 -0.28027808  0.207237627
## Dodge Challenger    -0.226237802  0.1062181942  0.09489585  0.316055390
## AMC Javelin         -0.252565496  0.2888101997  0.08161916  0.321900593
## Camaro Z28           0.389118986  0.9468795171 -0.21157976  0.038657331
## Pontiac Firebird     0.299261925 -0.1983310387  0.47269865 -0.234144182
## Fiat X1-9            0.206753365 -0.1449905641 -0.35850305  0.089109764
## Porsche 914-2       -0.597313744 -0.3394265065  0.82032965  0.634987241
## Lotus Europa         0.694667640  0.0165037718  0.51018011  0.004140777
## Ford Pantera L       1.101648091 -0.1746156635  0.41358868  0.609167214
## Ferrari Dino        -0.848833976 -0.0097569921  0.02967883  0.014187801
## Maserati Bora        0.455265189 -0.0156094416 -0.18813730 -0.558646792
## Volvo 142E           0.321777017 -0.3263029217 -0.77995741  0.476634473
##                             PC9        PC10         PC11
## Mazda RX4            0.14642303 -0.06670350  0.179693570
## Mazda RX4 Wag        0.07452829 -0.12692766  0.088644265
## Datsun 710          -0.13122859  0.04573787 -0.094632914
## Hornet 4 Drive       0.22021812 -0.06039981  0.147611269
## Hornet Sportabout   -0.02117623 -0.05983003  0.146406899
## Valiant             -0.03222907 -0.20165466  0.019545064
## Duster 360          -0.17844547  0.36086641  0.171863162
## Merc 240D            0.37698418  0.29086529 -0.019090358
## Merc 230            -0.86455356 -0.11597058  0.159688512
## Merc 280             0.54092449 -0.22093750 -0.124486227
## Merc 280C            0.30876072 -0.34417564 -0.034578568
## Merc 450SE          -0.05614966 -0.06531727 -0.396445135
## Merc 450SL          -0.20501055 -0.10761308 -0.197616838
## Merc 450SLC         -0.38704169 -0.21191036 -0.142498830
## Cadillac Fleetwood   0.19333387  0.06184979  0.262886205
## Lincoln Continental  0.19523562  0.12094849  0.039191100
## Chrysler Imperial    0.27447514  0.27588169 -0.224420191
## Fiat 128             0.10550311 -0.02717077 -0.208865888
## Honda Civic          0.23711675 -0.15433928  0.246835364
## Toyota Corolla      -0.11313627 -0.12606845 -0.031747839
## Toyota Corona       -0.44646972  0.51147635  0.063679725
## Dodge Challenger     0.10435633 -0.13641143  0.049594456
## AMC Javelin         -0.12237636 -0.29628634  0.045293027
## Camaro Z28          -0.05282991  0.32624525 -0.099386307
## Pontiac Firebird     0.20849043  0.01547674  0.122593248
## Fiat X1-9           -0.02228967 -0.08414018 -0.005746448
## Porsche 914-2       -0.12999660  0.34968156 -0.111596656
## Lotus Europa         0.29680350  0.23980308  0.030015592
## Ford Pantera L      -0.23280792 -0.50262890 -0.042242570
## Ferrari Dino         0.09813571  0.14491815  0.043006835
## Maserati Bora       -0.34081133  0.04706368  0.062135486
## Volvo 142E          -0.04473670  0.11767108 -0.145329008

# Plotting the resultant principal components
# The parameter scale = 0 ensures that arrows
# are scaled to represent the loadings
biplot(my_pca, main = "Biplot", scale = 0)

# Compute standard deviation
my_pca$sdev

##  [1] 2.5706809 1.6280258 0.7919579 0.5192277 0.4727061 0.4599958 0.3677798
##  [8] 0.3505730 0.2775728 0.2281128 0.1484736

# Compute variance
my_pca.var <- my_pca$sdev ^ 2
my_pca.var

##  [1] 6.60840025 2.65046789 0.62719727 0.26959744 0.22345110 0.21159612
##  [7] 0.13526199 0.12290143 0.07704665 0.05203544 0.02204441

# Proportion of variance for a scree plot
propve <- my_pca.var / sum(my_pca.var)
propve

##  [1] 0.600763659 0.240951627 0.057017934 0.024508858 0.020313737 0.019236011
##  [7] 0.012296544 0.011172858 0.007004241 0.004730495 0.002004037

# Plot variance explained for each principal component
plot(propve, xlab = "principal component",
            ylab = "Proportion of Variance Explained",
            ylim = c(0, 1), type = "b",
            main = "Scree Plot")

# Plot the cumulative proportion of variance explained
plot(cumsum(propve),
    xlab = "Principal Component",
    ylab = "Cumulative Proportion of Variance Explained",
    ylim = c(0, 1), type = "b")

# Find Top n principal component
# which will atleast cover 90 % variance of dimension
which(cumsum(propve) >= 0.9)[1]

## [1] 4

# Predict mpg using first 4 new Principal Components
# Add a training set with principal components
train.data <- data.frame(disp = mtcars$disp, my_pca$x[, 1:4])
 
# Running a Decision tree algporithm
## Loading packages
library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.1.3

rpart.model <- rpart(disp ~ .,
                    data = train.data, method = "anova")
 
rpart.plot(rpart.model)

Conclusion

In conclusion, this primer has provided an overview of factor analysis, a statistical technique commonly used in research to identify underlying dimensions or constructs that explain the variability among a set of observed variables. We have covered the meaning and assumptions of factor analysis, the differences between exploratory factor analysis (EFA) and confirmatory factor analysis (CFA), and the procedure for conducting factor analysis. We have also discussed the role of the correlation matrix and a general model of the correlation matrix of individual variables, as well as methods for extracting factors, such as principal component analysis (PCA), and determining the number of factors to be extracted.

Furthermore, we have covered the meaning and interpretation of communality and eigenvalues, factor loading and rotation methods, such as varimax, and the meaning and interpretation of factor scores and their use in subsequent analyses. Throughout the paper, we have used the R software to provide reproducible examples and code for conducting factor analysis.

By understanding the fundamental concepts of factor analysis and how to apply it in their research, readers will be able to identify underlying constructs or dimensions that may not be directly observable, and use these constructs to better understand the relationships between variables. Factor analysis can be a useful tool for researchers in a variety of fields, and this tutorial paper has provided a comprehensive guide to help readers get started with conducting factor analysis in their own research.

Thanks for your attention

A Primer on Factor Analysis in Research using Reproducible R Software

Abdisalam Hassan Muse (PhD)

2023-06-30

Abstract

Introduction

Module I: Factor Analysis in Research

Meaning of Factor Analysis

Assumptions of Factor Analysis

EFA and CFA Factor Analysis Procedures

Comparison between EFA and CFA

Real-life Examples

Module II: Correlation Matrix

Role of a Correlation Matrix in Factor Analysis

A general model of a correlation matrix of individual variables

Interpreting the correlation matrix

Example using R Code

Module III: Factors Extraction

Introduction

Example 2

Determining the number of factors to be extracted

Real-life Example 1

Real-life Example 2

Module IV: Communality and Eigen Values

Introduction

Meaning of communality

Role of communality in Factor Analysis

Computing communality

Example

Interpreting communality

Eigen Value

Role of eigen value

Computing eigen value

Example

Interpreting eigen value

Using eigen values and communality in factor analysis

Interpretation of the results:

Example 1

Example 2

Interpretation of the results:

Module V: Factor Loading

Introduction

Meaning and Definition of factor loading

Role of factor loading

Computing factor loading

Example 1

Example 2

Factor rotation

Why Factor rotation?

Methods of Factor rotation

Computing and Interpreting VARIMAX AND QUARTIMAX Rotations in R

Example

Confirmatory Factor Analysis (CFA)

Varimax, Oblimin and Promax Factor Rotation Techniques

Varimax Rotation

Oblimin Rotation

Promax Rotation

Module VI: Factor Scores

Introduction

Meaning and Definition of Factor Scores

Interpretation of Factor Scores

Practical Class 1

Exploratory Factor Analysis in R

Practical Class 2

Principal Component Analysis in R

Conclusion