Source of the content: http://r-marketing.r-forge.r-project.org/Instructor/Intro%20Factor%20Analysis/intro-factor-analysis.pdf
Factor Analysis: Basic Framework
From the original variables, factor analysis (FA) tries to find a smaller number of derived variables (factors) that meet theseconditions:
1 Maximally capture the correlations among the original variables (after accounting for error)
2 Each factor is associated clearly with a subset of the variables
3 Each variable is associated clearly with (ideally) only one factor
4 The factors are maximally differentiated from one another
These are rarely met perfectly in practice, but when they are approximated, the solution is close to “simple structure” that is very interpretable.
Another way to look at FA is that it seeks latent variables. A latent variable is an unobservable data generating process — such as a mental state — that is manifested in measurable quantities (such as survey items).
The product interest survey was designed to assess three latent variables:
General interest in a product category
Detailed interest in specific features
Interest in the product as an “image” product
Each of those is assessed with multiple items because any single item is imperfect.
People often confuse between the terms – factor analysis/ exploratory factor analysis and confirmatory factor analysis. Following table shows the difference between the two—
KEY TERMS IN FACTOR ANALYSIS
Latent variable: a presumed cognitive or data generating process that leads to observable data. This is often a theoretical construct.
Example: Product interest. Symbol: circle/oval, such as F1 .
Factor: a dimensional reduction that estimates a latent variable and its relationship to manifest variables.
Example: InterestFactor.
Loading: the strength of relationship between a factor and a variable.
Example: F1 → v1 = 0.45.
Ranges [-1.0 . . . 1.0], same as Pearson’s r.
Following diagram shows the a typical workflow of Factor Analysis—
Let us do Factor Analyais on a dataset. The dataset contains 11 items for simulated product interest and engagement data (PIES), rated on 7 point Likert type scale. We will determine the right number of factors and their variable loadings.
Some items’s scoring is transformed in following way—
[1] 3600 11
NotImportant, NeverThink, VeryInterested, LookFeatures, InvestigateDepth, SomeAreBetter, LearnAboutOptions, OthersOpinion, ExpressesPerson, TellAbout and MatchImage
NotImportant | NeverThink | VeryInterested | LookFeatures |
---|---|---|---|
Min. :1.000 | Min. :1.000 | Min. :1.00 | Min. :1.000 |
1st Qu.:4.000 | 1st Qu.:3.000 | 1st Qu.:3.00 | 1st Qu.:3.000 |
Median :4.000 | Median :4.000 | Median :4.00 | Median :4.000 |
Mean :4.339 | Mean :4.104 | Mean :4.11 | Mean :4.039 |
3rd Qu.:5.000 | 3rd Qu.:5.000 | 3rd Qu.:5.00 | 3rd Qu.:5.000 |
Max. :7.000 | Max. :7.000 | Max. :7.00 | Max. :7.000 |
InvestigateDepth | SomeAreBetter | LearnAboutOptions | OthersOpinion |
---|---|---|---|
Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 |
1st Qu.:3.000 | 1st Qu.:3.000 | 1st Qu.:3.000 | 1st Qu.:3.000 |
Median :4.000 | Median :4.000 | Median :4.000 | Median :4.000 |
Mean :3.999 | Mean :3.922 | Mean :3.872 | Mean :3.904 |
3rd Qu.:5.000 | 3rd Qu.:5.000 | 3rd Qu.:5.000 | 3rd Qu.:5.000 |
Max. :7.000 | Max. :7.000 | Max. :7.000 | Max. :7.000 |
ExpressesPerson | TellAbout | MatchImage |
---|---|---|
Min. :1.000 | Min. :1.0 | Min. :1.000 |
1st Qu.:3.000 | 1st Qu.:3.0 | 1st Qu.:3.000 |
Median :4.000 | Median :4.0 | Median :4.000 |
Mean :4.023 | Mean :3.9 | Mean :3.853 |
3rd Qu.:5.000 | 3rd Qu.:5.0 | 3rd Qu.:4.250 |
Max. :7.000 | Max. :7.0 | Max. :7.000 |
There is usually not a definitive answer. Choosing number of factors is partially a matter of usefulness.
Generally, look for consensus among:
- Theory: how many do you expect?
- Correlation matrix: how many seem to be there?
- Eigenvalues: how many Factors have Eigenvalue > 1?
- Eigenvalue scree plot: where is the “bend” in extraction?
- Parallel analysis and acceleration
In factor analysis, an eigenvalue is the proportion of total shared (i.e., non-error) variance explained by each factor.
A factor is only useful if it explains more than 1 variable . . . and thus has eigenvalue > 1.0.
3.661, 1.642, 1.275, 0.6881, 0.5801, 0.572, 0.5608, 0.5388, 0.529, 0.4834 and 0.4701
This thumb rule suggests 3 factors in the data.
EFA can be thought of as slicing a pizza. The same material(variance) can be carved up in ways that are mathematically identical, but might be more or less useful for a given situation.
Key decision: do you want the extracted factors to be correlated or not? In FA jargon, orthogonal or oblique?
By default, EFA looks for orthogonal factors that have r=0 correlation. This maximizes the interpretability, so orthogonal rotation is recommended in most cases, at least to start.
Some rotation options
Default: varimax: orthogonal rotation that aims for clear factor/variable structure. Generally recommended.
Oblique: oblimin: finds correlated factors while aiming for interpretability. Recommended if you want an oblique solution.
Oblique: promax: finds correlated factors similarly, but computationally different (good alternative). Recommended alternative if oblimin is not available or has difficulty.
Many others . . . : dozens have been developed. They are useful mostly when you’re very concerned about psychometrics
Let us fit the model with orthogonal rotation.
Factor analysis with Call: fa(r = data, nfactors = 3, rotate = “varimax”)
Test of the hypothesis that 3 factors are sufficient. The degrees of freedom for the model is 25 and the objective function was 0 The number of observations was 3600 with Chi Square = 17.17 with prob < 0.88
The root mean square of the residuals (RMSA) is 0 The df corrected root mean square of the residuals is 0.01
Tucker Lewis Index of factoring reliability = 1.002 RMSEA index = 0 and the 10 % confidence intervals are 0 0.007 BIC = -187.54Factor Analysis using method = minres
Call: fa(r = data, nfactors = 3, rotate = "varimax")
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 com
NotImportant 0.14 0.12 0.67 0.49 0.51 1.1
NeverThink 0.10 0.10 0.61 0.40 0.60 1.1
VeryInterested 0.28 0.36 0.48 0.44 0.56 2.5
LookFeatures 0.15 0.61 0.10 0.40 0.60 1.2
InvestigateDepth 0.13 0.72 0.10 0.54 0.46 1.1
SomeAreBetter 0.07 0.52 0.09 0.28 0.72 1.1
LearnAboutOptions 0.13 0.68 0.15 0.50 0.50 1.2
OthersOpinion 0.67 0.14 0.13 0.48 0.52 1.2
ExpressesPerson 0.71 0.14 0.13 0.53 0.47 1.1
TellAbout 0.65 0.12 0.14 0.46 0.54 1.2
MatchImage 0.63 0.13 0.08 0.42 0.58 1.1
MR1 MR2 MR3
SS loadings 1.94 1.83 1.17
Proportion Var 0.18 0.17 0.11
Cumulative Var 0.18 0.34 0.45
Proportion Explained 0.39 0.37 0.24
Cumulative Proportion 0.39 0.76 1.00
Mean item complexity = 1.3
Test of the hypothesis that 3 factors are sufficient.
The degrees of freedom for the null model are 55 and the objective function was 2.76 with Chi Square of 9905.74
The degrees of freedom for the model are 25 and the objective function was 0
The root mean square of the residuals (RMSR) is 0
The df corrected root mean square of the residuals is 0.01
The harmonic number of observations is 3600 with the empirical chi square 9.78 with prob < 1
The total number of observations was 3600 with Likelihood Chi Square = 17.17 with prob < 0.88
Tucker Lewis Index of factoring reliability = 1.002
RMSEA index = 0 and the 90 % confidence intervals are 0 0.007
BIC = -187.54
Fit based upon off diagonal values = 1
Measures of factor score adequacy
MR1 MR2 MR3
Correlation of (regression) scores with factors 0.87 0.86 0.79
Multiple R square of scores with factors 0.75 0.73 0.62
Minimum correlation of possible factor scores 0.50 0.46 0.24
MR1 | MR2 | MR3 | |
---|---|---|---|
NotImportant | 0.674 | ||
NeverThink | 0.614 | ||
VeryInterested | 0.362 | 0.476 | |
LookFeatures | 0.607 | ||
InvestigateDepth | 0.716 | ||
SomeAreBetter | 0.518 | ||
LearnAboutOptions | 0.678 | ||
OthersOpinion | 0.665 | ||
ExpressesPerson | 0.706 | ||
TellAbout | 0.655 | ||
MatchImage | 0.633 |
ImageF | FeatureF | GeneralF |
---|---|---|
0.5111 | -1.241 | 0.7928 |
-0.06877 | 0.2804 | 0.6641 |
-0.3027 | -0.1043 | -0.8785 |
-0.8661 | -1.106 | 0.4226 |
-0.692 | -0.08929 | -0.403 |
1.534 | -0.3898 | -0.06599 |
ImageF | FeatureF | GeneralF | |
---|---|---|---|
3595 | -0.0229 | -0.204 | -0.08209 |
3596 | 0.2773 | 0.4208 | 0.4915 |
3597 | 1.938 | -1.261 | 0.3646 |
3598 | -0.4837 | 0.4699 | 1.331 |
3599 | -0.5671 | 0.9789 | 0.2916 |
3600 | -0.9219 | 0.6484 | 1.162 |
Confirmatory Factor Analysis
CFA is a special case of structural equation modeling (SEM), applied to latent variable assessment, usually for surveys and similar data.
1 Assess the structure of survey scales — do items load where one would hope?
2 Evaluate the fit / appropriateness of a factor model — is a proposed model better than alternatives?
3 Evaluate the weights of items relative to one another and a scale — do they contribute equally?
4 Model other effects such as method effects and hierarchical relationships
Steps in CFA
1. Define your hypothesized/favored model with relationships of latent variables to manifest variables.
2 Define 1 or more alternative models that are reasonable, but which you believe are inferior.
3 Fit the models to your data.
4 Determine whether your model is good enough (fit indices, paths)
5 Determine whether your model is better than the alternative
6 Intepret your model
Let’s try to fit 3 factor model on our data and we will compare it with 1 factor model shown below—
Model fit indices are the measures to evaluate and compare the CFA models.
Following model fit indices are most frequently referred—
Global fit indices
Example: Comparative Fit Index (CFI). Attempts to assess “absolute” fit vs. the data. Not very good measures, but set a minimum bar: want fit > 0.90.
Approximation error and residuals
Example: Standardized Root Mean Square Residual (SRMR). Difference between the data’s covariance matrix and the fitted model’s matrix. Want SRMR < 0.08. For Root Mean Square Error of Approximation, want Lower-CI(RMSEA) < 0.05.
Information Criteria
Example: Akaike Information Criterion (AIC). Assesses the model’s fit vs. the observed data. No absolute interpretation, but lower is better. Difference of 10 or more is large.
library(lavaan)
Model3 <- " General =~ NotImportant + NeverThink + VeryInterested
Feature =~ LookFeatures + InvestigateDepth + SomeAreBetter + LearnAboutOptions
Image =~ OthersOpinion + ExpressesPerson + TellAbout + MatchImage"
fit3 <- cfa(Model3, data=data)
pander(summary(fit3, fit.measures=TRUE))
lavaan 0.6-6 ended normally after 36 iterations
Estimator ML Optimization method NLMINB Number of free parameters 25
Number of observations 3600
Model Test User Model:
Test statistic 287.649 Degrees of freedom 41 P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 9920.901 Degrees of freedom 55 P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.975 Tucker-Lewis Index (TLI) 0.966
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -52885.888 Loglikelihood unrestricted model (H1) -52742.064
Akaike (AIC) 105821.776 Bayesian (BIC) 105976.494 Sample-size adjusted Bayesian (BIC) 105897.056
Root Mean Square Error of Approximation:
RMSEA 0.041 90 Percent confidence interval - lower 0.036 90 Percent confidence interval - upper 0.045 P-value RMSEA <= 0.05 1.000
Standardized Root Mean Square Residual:
SRMR 0.030
Parameter Estimates:
Standard errors Standard Information Expected Information saturated (h1) model Structured
Latent Variables: Estimate Std.Err z-value P(>|z|) General =~
NotImportant 1.000
NeverThink 0.948 0.042 22.415 0.000 VeryInterested 1.305 0.052 25.268 0.000 Feature =~
LookFeatures 1.000
InvestigatDpth 1.168 0.037 31.168 0.000 SomeAreBetter 0.822 0.033 25.211 0.000 LearnAbotOptns 1.119 0.036 31.022 0.000 Image =~
OthersOpinion 1.000
ExpressesPersn 0.963 0.028 34.657 0.000 TellAbout 0.908 0.027 33.146 0.000 MatchImage 0.850 0.027 31.786 0.000
Covariances: Estimate Std.Err z-value P(>|z|) General ~~
Feature 0.217 0.012 17.561 0.000 Image 0.231 0.013 17.348 0.000 Feature ~~
Image 0.202 0.013 15.650 0.000
Variances: Estimate Std.Err z-value P(>|z|) .NotImportant 0.657 0.020 33.498 0.000 .NeverThink 0.796 0.022 35.967 0.000 .VeryInterested 0.463 0.022 21.479 0.000 .LookFeatures 0.657 0.019 33.973 0.000 .InvestigatDpth 0.554 0.019 28.588 0.000 .SomeAreBetter 0.779 0.021 37.701 0.000 .LearnAbotOptns 0.533 0.018 29.199 0.000 .OthersOpinion 0.640 0.020 32.071 0.000 .ExpressesPersn 0.476 0.016 29.501 0.000 .TellAbout 0.560 0.017 32.697 0.000 .MatchImage 0.599 0.017 34.500 0.000 General 0.337 0.021 15.799 0.000 Feature 0.446 0.024 18.684 0.000 Image 0.591 0.028 21.092 0.000
FIT:
npar | fmin | chisq | df | pvalue | baseline.chisq | baseline.df |
---|---|---|---|---|---|---|
25 | 0.03995 | 287.6 | 41 | 0 | 9921 | 55 |
baseline.pvalue | cfi | tli | logl | unrestricted.logl | aic |
---|---|---|---|---|---|
0 | 0.975 | 0.9665 | -52886 | -52742 | 105822 |
bic | ntotal | bic2 | rmsea | rmsea.ci.lower | rmsea.ci.upper |
---|---|---|---|---|---|
105976 | 3600 | 105897 | 0.04088 | 0.03649 | 0.0454 |
rmsea.pvalue | srmr |
---|---|
0.9996 | 0.03033 |
PE:
lhs | op | rhs | exo | est | se | z | pvalue |
---|---|---|---|---|---|---|---|
General | =~ | NotImportant | 0 | 1 | 0 | NA | NA |
General | =~ | NeverThink | 0 | 0.9484 | 0.04231 | 22.41 | 0 |
General | =~ | VeryInterested | 0 | 1.305 | 0.05165 | 25.27 | 0 |
Feature | =~ | LookFeatures | 0 | 1 | 0 | NA | NA |
Feature | =~ | InvestigateDepth | 0 | 1.168 | 0.03748 | 31.17 | 0 |
Feature | =~ | SomeAreBetter | 0 | 0.8216 | 0.03259 | 25.21 | 0 |
Feature | =~ | LearnAboutOptions | 0 | 1.119 | 0.03606 | 31.02 | 0 |
Image | =~ | OthersOpinion | 0 | 1 | 0 | NA | NA |
Image | =~ | ExpressesPerson | 0 | 0.9629 | 0.02778 | 34.66 | 0 |
Image | =~ | TellAbout | 0 | 0.9075 | 0.02738 | 33.15 | 0 |
Image | =~ | MatchImage | 0 | 0.8499 | 0.02674 | 31.79 | 0 |
NotImportant | ~~ | NotImportant | 0 | 0.6575 | 0.01963 | 33.5 | 0 |
NeverThink | ~~ | NeverThink | 0 | 0.7964 | 0.02214 | 35.97 | 0 |
VeryInterested | ~~ | VeryInterested | 0 | 0.4631 | 0.02156 | 21.48 | 0 |
LookFeatures | ~~ | LookFeatures | 0 | 0.6568 | 0.01933 | 33.97 | 0 |
InvestigateDepth | ~~ | InvestigateDepth | 0 | 0.5543 | 0.01939 | 28.59 | 0 |
SomeAreBetter | ~~ | SomeAreBetter | 0 | 0.7794 | 0.02067 | 37.7 | 0 |
LearnAboutOptions | ~~ | LearnAboutOptions | 0 | 0.5325 | 0.01824 | 29.2 | 0 |
OthersOpinion | ~~ | OthersOpinion | 0 | 0.6399 | 0.01995 | 32.07 | 0 |
ExpressesPerson | ~~ | ExpressesPerson | 0 | 0.4761 | 0.01614 | 29.5 | 0 |
TellAbout | ~~ | TellAbout | 0 | 0.56 | 0.01713 | 32.7 | 0 |
MatchImage | ~~ | MatchImage | 0 | 0.5991 | 0.01737 | 34.5 | 0 |
General | ~~ | General | 0 | 0.3366 | 0.0213 | 15.8 | 0 |
Feature | ~~ | Feature | 0 | 0.4462 | 0.02388 | 18.68 | 0 |
Image | ~~ | Image | 0 | 0.5908 | 0.02801 | 21.09 | 0 |
General | ~~ | Feature | 0 | 0.217 | 0.01236 | 17.56 | 0 |
General | ~~ | Image | 0 | 0.2312 | 0.01333 | 17.35 | 0 |
Feature | ~~ | Image | 0 | 0.2023 | 0.01292 | 15.65 | 0 |
Model1 <- " Int =~ NotImportant + NeverThink + VeryInterested+ LookFeatures + InvestigateDepth + SomeAreBetter + LearnAboutOptions+ OthersOpinion + ExpressesPerson + TellAbout + MatchImage"
fit1 <- cfa(Model1, data=data)
pander(summary(fit1, fit.measures=TRUE))
lavaan 0.6-6 ended normally after 33 iterations
Estimator ML Optimization method NLMINB Number of free parameters 22
Number of observations 3600
Model Test User Model:
Test statistic 3284.581 Degrees of freedom 44 P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 9920.901 Degrees of freedom 55 P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.672 Tucker-Lewis Index (TLI) 0.589
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -54384.354 Loglikelihood unrestricted model (H1) -52742.064
Akaike (AIC) 108812.709 Bayesian (BIC) 108948.860 Sample-size adjusted Bayesian (BIC) 108878.955
Root Mean Square Error of Approximation:
RMSEA 0.143 90 Percent confidence interval - lower 0.139 90 Percent confidence interval - upper 0.147 P-value RMSEA <= 0.05 0.000
Standardized Root Mean Square Residual:
SRMR 0.102
Parameter Estimates:
Standard errors Standard Information Expected Information saturated (h1) model Structured
Latent Variables: Estimate Std.Err z-value P(>|z|) Int =~
NotImportant 1.000
NeverThink 0.913 0.058 15.851 0.000 VeryInterested 1.475 0.071 20.794 0.000 LookFeatures 1.251 0.066 19.028 0.000 InvestigatDpth 1.355 0.069 19.534 0.000 SomeAreBetter 0.979 0.059 16.673 0.000 LearnAbotOptns 1.341 0.068 19.734 0.000 OthersOpinion 1.553 0.076 20.502 0.000 ExpressesPersn 1.465 0.070 20.789 0.000 TellAbout 1.405 0.069 20.335 0.000 MatchImage 1.312 0.066 19.815 0.000
Variances: Estimate Std.Err z-value P(>|z|) .NotImportant 0.821 0.020 40.286 0.000 .NeverThink 0.955 0.023 40.895 0.000 .VeryInterested 0.660 0.018 36.603 0.000 .LookFeatures 0.832 0.021 39.117 0.000 .InvestigatDpth 0.845 0.022 38.601 0.000 .SomeAreBetter 0.915 0.023 40.584 0.000 .LearnAbotOptns 0.780 0.020 38.362 0.000 .OthersOpinion 0.813 0.022 37.195 0.000 .ExpressesPersn 0.652 0.018 36.614 0.000 .TellAbout 0.705 0.019 37.490 0.000 .MatchImage 0.728 0.019 38.259 0.000 Int 0.173 0.015 11.794 0.000
FIT:
npar | fmin | chisq | df | pvalue | baseline.chisq | baseline.df |
---|---|---|---|---|---|---|
22 | 0.4562 | 3285 | 44 | 0 | 9921 | 55 |
baseline.pvalue | cfi | tli | logl | unrestricted.logl | aic |
---|---|---|---|---|---|
0 | 0.6715 | 0.5894 | -54384 | -52742 | 108813 |
bic | ntotal | bic2 | rmsea | rmsea.ci.lower | rmsea.ci.upper |
---|---|---|---|---|---|
108949 | 3600 | 108879 | 0.143 | 0.1389 | 0.1472 |
rmsea.pvalue | srmr |
---|---|
0 | 0.1019 |
PE:
lhs | op | rhs | exo | est | se | z | pvalue |
---|---|---|---|---|---|---|---|
Int | =~ | NotImportant | 0 | 1 | 0 | NA | NA |
Int | =~ | NeverThink | 0 | 0.9127 | 0.05758 | 15.85 | 0 |
Int | =~ | VeryInterested | 0 | 1.475 | 0.07095 | 20.79 | 0 |
Int | =~ | LookFeatures | 0 | 1.251 | 0.06573 | 19.03 | 0 |
Int | =~ | InvestigateDepth | 0 | 1.355 | 0.06937 | 19.53 | 0 |
Int | =~ | SomeAreBetter | 0 | 0.9794 | 0.05874 | 16.67 | 0 |
Int | =~ | LearnAboutOptions | 0 | 1.341 | 0.06795 | 19.73 | 0 |
Int | =~ | OthersOpinion | 0 | 1.553 | 0.07575 | 20.5 | 0 |
Int | =~ | ExpressesPerson | 0 | 1.465 | 0.07049 | 20.79 | 0 |
Int | =~ | TellAbout | 0 | 1.405 | 0.06908 | 20.34 | 0 |
Int | =~ | MatchImage | 0 | 1.312 | 0.06622 | 19.82 | 0 |
NotImportant | ~~ | NotImportant | 0 | 0.821 | 0.02038 | 40.29 | 0 |
NeverThink | ~~ | NeverThink | 0 | 0.9549 | 0.02335 | 40.89 | 0 |
VeryInterested | ~~ | VeryInterested | 0 | 0.6598 | 0.01802 | 36.6 | 0 |
LookFeatures | ~~ | LookFeatures | 0 | 0.8322 | 0.02127 | 39.12 | 0 |
InvestigateDepth | ~~ | InvestigateDepth | 0 | 0.8453 | 0.0219 | 38.6 | 0 |
SomeAreBetter | ~~ | SomeAreBetter | 0 | 0.9146 | 0.02254 | 40.58 | 0 |
LearnAboutOptions | ~~ | LearnAboutOptions | 0 | 0.7797 | 0.02032 | 38.36 | 0 |
OthersOpinion | ~~ | OthersOpinion | 0 | 0.8134 | 0.02187 | 37.19 | 0 |
ExpressesPerson | ~~ | ExpressesPerson | 0 | 0.6522 | 0.01781 | 36.61 | 0 |
TellAbout | ~~ | TellAbout | 0 | 0.7051 | 0.01881 | 37.49 | 0 |
MatchImage | ~~ | MatchImage | 0 | 0.7279 | 0.01903 | 38.26 | 0 |
Int | ~~ | Int | 0 | 0.1731 | 0.01468 | 11.79 | 0 |
* Dr AMITA SHARMA Post Doc from Erasmus University, Rotterdam, the Netherlands Assistant Professor Institute of Agri Business Management, Swami Keshwanand Rajasthan Agricultural University, Bikaner (Raj),India Blog: www.thinkingai.in
* ARUN KUMAR SHARMA Machine Learning Enthusiast 13 Years of Financial Services Marketing Exp Blogger, Writer and Machine Learning Consutlant Certified Business Analytics Professional Certified in Predictive Analytics, Indian Institute of Mnamagement,IIMx Bangalore Certified in Macroeconomic Forecasting, International Monetary Fund(IMFx) Certified in Text Analytics, openSAP Email: aks10000@gmail.com Tel:9468567418
=============================================
---
title: "Factor Analysis Case Study"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
social : ["facebook","twitter","linkedin", "menu"]
source_code: embed
---
Introduction {data-navmenu="LearnMore"}
======================================
Source of the content: http://r-marketing.r-forge.r-project.org/Instructor/Intro%20Factor%20Analysis/intro-factor-analysis.pdf
Factor Analysis: Basic Framework
From the original variables, factor analysis (FA) tries to find a smaller number of derived variables (factors) that meet theseconditions:
1 Maximally capture the correlations among the original variables
(after accounting for error)
2 Each factor is associated clearly with a subset of the variables
3 Each variable is associated clearly with (ideally) only one factor
4 The factors are maximally differentiated from one another
These are rarely met perfectly in practice, but when they are
approximated, the solution is close to “simple structure” that is very interpretable.
Another way to look at FA is that it seeks latent variables. A latent
variable is an unobservable data generating process — such as a
mental state — that is manifested in measurable quantities (such as
survey items).
The product interest survey was designed to assess three latent
variables:
General interest in a product category
Detailed interest in specific features
Interest in the product as an “image” product
Each of those is assessed with multiple items because any single
item is imperfect.
People often confuse between the terms -- factor analysis/ exploratory factor analysis and confirmatory factor analysis. Following table shows the difference between the two---
KEY TERMS IN FACTOR ANALYSIS
Latent variable: a presumed cognitive or data generating process
that leads to observable data. This is often a theoretical construct.
Example: Product interest. Symbol: circle/oval, such as F1 .
Factor: a dimensional reduction that estimates a latent variable
and its relationship to manifest variables.
Example: InterestFactor.
Loading: the strength of relationship between a factor and a
variable.
Example: F1 → v1 = 0.45.
Ranges [-1.0 . . . 1.0], same as Pearson’s r.
Following diagram shows the a typical workflow of Factor Analysis---
Let us do Factor Analyais on a dataset. The dataset contains 11 items for simulated product interest and engagement data (PIES), rated on 7 point Likert type scale. We will determine the right number of factors and their variable loadings.
Some items's scoring is transformed in following way---
Dataset {data-navmenu="LearnMore"}
===============================================================
Column {.tabset}
-----------------------------------------
### Data Table
```{r}
library(pander)
data=read.csv("factoranalysis.csv", header=T, stringsAsFactors = FALSE)
colnames(data)=c("NotImportant","NeverThink","VeryInterested","LookFeatures","InvestigateDepth","SomeAreBetter","LearnAboutOptions","OthersOpinion","ExpressesPerson","TellAbout","MatchImage")
DT::datatable(data, caption="Data View of simulated product interest and engagement data",
filter="top")
```
### Dimension
```{r}
dim(data)
```
### Variable Names
```{r}
pander(colnames(data))
```
### Summary
```{r}
pander(summary(data))
```
### Correlation Plot
```{r}
corrplot::corrplot(cor(data), diag = FALSE)
```
Determining Number of Factors {data-navmenu="LearnMore"}
==========================================================
There is usually not a definitive answer. Choosing number of factors
is partially a matter of usefulness.
Generally, look for consensus among:
- Theory: how many do you expect?
- Correlation matrix: how many seem to be there?
- Eigenvalues: how many Factors have Eigenvalue > 1?
- Eigenvalue scree plot: where is the “bend” in extraction?
- Parallel analysis and acceleration
Column {.tabset}
-------------------------------------
### Eigenvalue
In factor analysis, an eigenvalue is the proportion of total shared (i.e., non-error) variance explained by each factor.
A factor is only useful if it explains more than 1 variable . . . and
thus has eigenvalue > 1.0.
```{r}
pander(eigen(cor(data))$values)
```
This thumb rule suggests 3 factors in the data.
### ScreePlot
```{r}
pc=prcomp(data, cor=TRUE)
screeplot(pc, type="lines")
```
Factor Rotation Model {data-navmenu="LearnMore"}
=========================================
EFA can be thought of as slicing a pizza. The same material(variance) can be carved up in ways that are mathematically
identical, but might be more or less useful for a given situation.
Key decision: do you want the extracted factors to be correlated or
not? In FA jargon, orthogonal or oblique?
By default, EFA looks for orthogonal factors that have r=0
correlation. This maximizes the interpretability, so orthogonal rotation is recommended in most cases, at least to start.
Some rotation options
Default: varimax: orthogonal rotation that aims for clear
factor/variable structure. Generally recommended.
Oblique: oblimin: finds correlated factors while aiming for
interpretability. Recommended if you want an oblique solution.
Oblique: promax: finds correlated factors similarly, but
computationally different (good alternative). Recommended
alternative if oblimin is not available or has difficulty.
Many others . . . : dozens have been developed. They are
useful mostly when you’re very concerned about psychometrics
Fitting the Model {data-navmenu="LearnMore"}
===================================================
Column {.tabset}
---------------------------------------
### Fitting the Model
Let us fit the model with orthogonal rotation.
```{r echo=TRUE}
library(psych)
data.fa <- fa(data, nfactors=3, rotate="varimax")
pander(summary(data.fa))
```
Summary of the Model
```{r echo=TRUE}
data.fa
```
### Factor Loadings on Manifest Variables
Generally, factor with loading on manifest variable more than 0.30 is considered---
```{r echo=FALSE}
L=round(data.fa$loadings, digits = 3)
struct=ifelse(L>0.3,as.numeric(L),' ')
pander(struct)
```
### Visualization
```{r echo=FALSE}
fa.diagram(data.fa)
```
Now, we can give name to these factors. Suggest!!! ????
In the sixth Step, you can repeat previous step with different rotations and try to compare the results from interpretation of the model point of view.
### Use the Factor Scores for Each Respondent
```{r}
fa.scores <- data.frame(data.fa$scores)
names(fa.scores) <- c("ImageF", "FeatureF", "GeneralF")
pander(head(fa.scores))
pander(tail(fa.scores))
```
Confirmatory Factor Analysis {data-navmenu="LearnMore"}
========================================
Column {.tabset}
----------------------------------------------
### CFA Introduction
Confirmatory Factor Analysis
CFA is a special case of structural equation modeling (SEM), applied
to latent variable assessment, usually for surveys and similar data.
1 Assess the structure of survey scales — do items load where
one would hope?
2 Evaluate the fit / appropriateness of a factor model — is a
proposed model better than alternatives?
3 Evaluate the weights of items relative to one another and a
scale — do they contribute equally?
4 Model other effects such as method effects and hierarchical
relationships
Steps in CFA
1. Define your hypothesized/favored model with relationships of
latent variables to manifest variables.
2 Define 1 or more alternative models that are reasonable, but
which you believe are inferior.
3 Fit the models to your data.
4 Determine whether your model is good enough (fit indices,
paths)
5 Determine whether your model is better than the alternative
6 Intepret your model
Let's try to fit 3 factor model on our data and we will compare it with 1 factor model shown below---
### Model Fit Measures
Model fit indices are the measures to evaluate and compare the CFA models.
Following model fit indices are most frequently referred---
Global fit indices
Example: Comparative Fit Index (CFI). Attempts to assess
“absolute” fit vs. the data. Not very good measures, but set a
minimum bar: want fit > 0.90.
Approximation error and residuals
Example: Standardized Root Mean Square Residual (SRMR).
Difference between the data’s covariance matrix and the fitted
model’s matrix. Want SRMR < 0.08. For Root Mean Square Error
of Approximation, want Lower-CI(RMSEA) < 0.05.
Information Criteria
Example: Akaike Information Criterion (AIC). Assesses the model’s
fit vs. the observed data. No absolute interpretation, but lower is
better. Difference of 10 or more is large.
### 3 Factor Model
```{r echo=TRUE}
library(lavaan)
Model3 <- " General =~ NotImportant + NeverThink + VeryInterested
Feature =~ LookFeatures + InvestigateDepth + SomeAreBetter + LearnAboutOptions
Image =~ OthersOpinion + ExpressesPerson + TellAbout + MatchImage"
fit3 <- cfa(Model3, data=data)
pander(summary(fit3, fit.measures=TRUE))
```
### 1 Factor Model
```{r echo=TRUE}
Model1 <- " Int =~ NotImportant + NeverThink + VeryInterested+ LookFeatures + InvestigateDepth + SomeAreBetter + LearnAboutOptions+ OthersOpinion + ExpressesPerson + TellAbout + MatchImage"
fit1 <- cfa(Model1, data=data)
pander(summary(fit1, fit.measures=TRUE))
```
### Model Paths
```{r}
semPlot::semPaths(fit3, "std")
```
About Us {data-navmenu="LearnMore"}
=====================================================
### DASHBOARD PREPARED BY (CONTACT FOR MACHINE LEARNING TRAINING, COACHING , CONSULTING & Complete Analysis of this case study)
* Dr AMITA SHARMA
Post Doc from Erasmus University, Rotterdam, the Netherlands
Assistant Professor
Institute of Agri Business Management,
Swami Keshwanand Rajasthan Agricultural University,
Bikaner (Raj),India
Blog: www.thinkingai.in
* ARUN KUMAR SHARMA
Machine Learning Enthusiast
13 Years of Financial Services Marketing Exp
Blogger, Writer and Machine Learning Consutlant
Certified Business Analytics Professional
Certified in Predictive Analytics, Indian Institute of Mnamagement,IIMx Bangalore
Certified in Macroeconomic Forecasting, International Monetary Fund(IMFx)
Certified in Text Analytics, openSAP
Email: aks10000@gmail.com
Tel:9468567418
=============================================