Become familiar with exploratory factor analysis (EFA), another dimensionality reduction technique that is a natural extension to PCA.
Two data frames - hsq and hsq_correl - have been loaded. hsq contains the Humor Styles Questionnaire [HSQ] dataset, which includes responses from 1071 participants on 32 questions. We also calculated the polychoric correlation for you using the mixedCor() function of the psych package:
hsq_correl <- mixedCor(humor_styles, c=NULL, p=1:32)
Polychoric correlation is the correlation between ordinal variables. Above, we indicated that columns 1 to 32 are ordinal by specifying p, and that there are no numeric variables by setting c to NULL.
Another way of calculating polychoric correlations is by using the hetcor() function of the polycor package. It stores the correlation matrix in the correlations attribute of the calculated object.
library(psych)
library(tidyverse)
library(DT)
# Check Dataset
url <- "https://assets.datacamp.com/production/course_4249/datasets/humor_dataset.csv"
hsq <- read.csv(url, sep = ";")
datatable(head(hsq))
# Check out the dimensionality of hsq.
dim(hsq)
## [1] 1071 39
hsq_correl <- mixedCor(hsq, c=NULL, p=1:32)
## Warning in matpLower(x, nvar, gminx, gmaxx, gminy, gmaxy): 496 cells were
## adjusted for 0 values using the correction for continuity. Examine your
## data carefully.
str(hsq_correl)
## List of 6
## $ rho : num [1:32, 1:32] 1 -0.2094 -0.1772 -0.0945 -0.4466 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## $ rx : NULL
## $ poly :List of 4
## ..$ rho : num [1:32, 1:32] 1 -0.2094 -0.1772 -0.0945 -0.4466 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## .. .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## ..$ tau : num [1:32, 1:6] -2.77 -2.77 -2.9 -3.11 -2.9 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## .. .. ..$ : chr [1:6] "1" "2" "3" "4" ...
## ..$ n.obs: int 1071
## ..$ Call : language polychoric(x = data[, p], smooth = smooth, global = global, weight = weight, correct = correct)
## ..- attr(*, "class")= chr [1:2] "psych" "poly"
## $ tetra:List of 2
## ..$ rho: NULL
## ..$ tau: NULL
## $ rpd : NULL
## $ Call : language mixedCor(data = hsq, c = NULL, p = 1:32)
## - attr(*, "class")= chr [1:2] "psych" "mixed"
# Getting the correlation matrix of the dataset.
hsq_polychoric <- hsq_correl$rho
# Explore the correlation structure of the dataset.
library(ggcorrplot)
ggcorrplot(hsq_polychoric)
As mentioned in the video, before reducing dimensions with EFA, we first need to make sure that our dataset is factorable. In other words, the first step in performing EFA is to check whether it is even worth doing it. This dilemma is captured by the following question: Is there sufficient correlation among the observed variables of our dataset to allow for dimensionality reduction in the first place?
hsq_polychoric, calculated with the hetcor() function of the polycor package, is the correlation matrix of the Humor Styles Questionnaire [HSQ] dataset that you will be working throughout this and part of the next chapter.
In this exercise, your mission is to decide whether HSQ is factorable enough to allow an EFA.
# Apply the Bartlett test on the correlation matrix.
library(polycor)
##
## Attaching package: 'polycor'
## The following object is masked from 'package:psych':
##
## polyserial
cortest.bartlett(hsq_polychoric)
## Warning in cortest.bartlett(hsq_polychoric): n not specified, 100 used
## $chisq
## [1] 1114.409
##
## $p.value
## [1] 1.610583e-49
##
## $df
## [1] 496
# Check the KMO index.
KMO(hsq_polychoric)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = hsq_polychoric)
## Overall MSA = 0.87
## MSA for each item =
## Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15
## 0.94 0.93 0.91 0.90 0.91 0.88 0.82 0.86 0.95 0.86 0.78 0.90 0.85 0.93 0.82
## Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27 Q28 Q29 Q30
## 0.85 0.87 0.83 0.89 0.83 0.87 0.84 0.81 0.84 0.83 0.89 0.83 0.93 0.87 0.81
## Q31 Q32
## 0.81 0.91
Both the p.value attribute of cortest.bartlett()’s output is very much lower than 0.05 and the MSA attribute of KMO()’s output, 0.87, is close to 1, which means that they both recommend that EFA.
t is about time to turn our attention to creating our first EFA model object. As mentioned in the video, the default extraction method in the fa() function of the psych package is minimum residuals, minres. Recall that fa() takes as its main argument either a dataframe or a correlation matrix. In this exercise, your job is to manipulate the values of the function’s default arguments and inspect the EFA model object’s attributes to find out which of the variables load well on the 4 factors that we choose to extract below.
For conducting EFA, you will use the correlation matrix hsq_polychoric, loaded in your workspace and calculated with the mixedCor() function on our initial dataset, hsq.
# EFA with 4 factors.
f_hsq <- fa(hsq_polychoric, nfactors = 4)
## Loading required namespace: GPArotation
# Inspect the resulting EFA object.
str(f_hsq)
## List of 44
## $ residual : num [1:32, 1:32] 0.5202 0.01625 0.02522 -0.00162 0.00493 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## $ dof : num 374
## $ ENull : num NA
## $ chi : num NA
## $ rms : num 0.0411
## $ nh : logi NA
## $ EPVAL : num NA
## $ crms : num 0.0473
## $ EBIC : num NA
## $ ESABIC : num NA
## $ fit : num 0.849
## $ fit.off : num 0.969
## $ sd : num 0.0397
## $ factors : num 4
## $ complexity : Named num [1:32] 1.02 1.05 1.29 1 1.08 ...
## ..- attr(*, "names")= chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## $ n.obs : logi NA
## $ PVAL : logi NA
## $ objective : num 2.19
## $ criteria : Named num [1:3] 2.19 NA NA
## ..- attr(*, "names")= chr [1:3] "objective" "" ""
## $ Call : language fa(r = hsq_polychoric, nfactors = 4)
## $ null.model : num 12.7
## $ null.dof : num 496
## $ r.scores : num [1:4, 1:4] 1 -0.166 -0.434 0.212 -0.166 ...
## $ R2 : num [1:4] 0.886 0.867 0.866 0.821
## $ valid : num [1:4] 0.93 0.915 0.902 0.889
## $ score.cor : num [1:4, 1:4] 1 -0.198 -0.44 0.229 -0.198 ...
## $ weights : num [1:32, 1:4] 0.14289 -0.01175 -0.01633 0.00872 -0.13143 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
## $ rotation : chr "oblimin"
## $ communality : Named num [1:32] 0.48 0.408 0.363 0.407 0.433 ...
## ..- attr(*, "names")= chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## $ communalities: Named num [1:32] 0.48 0.408 0.363 0.407 0.433 ...
## ..- attr(*, "names")= chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## $ uniquenesses : Named num [1:32] 0.52 0.592 0.637 0.593 0.567 ...
## ..- attr(*, "names")= chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## $ values : num [1:32] 6.37 2.96 2.4 1.74 0.78 ...
## $ e.values : num [1:32] 6.91 3.5 2.96 2.28 1.45 ...
## $ loadings : 'loadings' num [1:32, 1:4] 0.6746 -0.0849 -0.1127 0.0182 -0.5985 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
## $ model : num [1:32, 1:32] 0.4798 -0.2256 -0.2024 -0.0929 -0.4516 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## $ fm : chr "minres"
## $ rot.mat : num [1:4, 1:4] 0.534 -0.503 -0.279 -0.764 0.308 ...
## $ Phi : num [1:4, 1:4] 1 -0.142 -0.375 0.18 -0.142 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
## .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
## $ Structure : 'loadings' num [1:32, 1:4] 0.6895 -0.3145 -0.2541 -0.0802 -0.6464 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
## $ method : chr "regression"
## $ R2.scores : Named num [1:4] 0.886 0.867 0.866 0.821
## ..- attr(*, "names")= chr [1:4] "MR1" "MR2" "MR4" "MR3"
## $ r : num [1:32, 1:32] 1 -0.2094 -0.1772 -0.0945 -0.4466 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
## $ fn : chr "fa"
## $ Vaccounted : num [1:5, 1:4] 3.973 0.124 0.124 0.295 0.295 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "SS loadings" "Proportion Var" "Cumulative Var" "Proportion Explained" ...
## .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
## - attr(*, "class")= chr [1:2] "psych" "fa"
# Use maximum likelihood for extracting factors.
fa(hsq_polychoric, nfactors = 4, fm = "mle")
## Factor Analysis using method = ml
## Call: fa(r = hsq_polychoric, nfactors = 4, fm = "mle")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML1 ML2 ML4 ML3 h2 u2 com
## Q1 0.67 -0.05 -0.01 0.04 0.48 0.52 1.0
## Q2 -0.09 -0.03 0.61 -0.04 0.42 0.58 1.1
## Q3 -0.12 0.13 0.08 -0.50 0.35 0.65 1.3
## Q4 0.01 0.62 0.01 0.00 0.39 0.61 1.0
## Q5 -0.60 0.07 0.08 -0.01 0.42 0.58 1.1
## Q6 -0.28 -0.03 0.42 -0.03 0.34 0.66 1.8
## Q7 -0.17 -0.03 -0.01 0.61 0.37 0.63 1.2
## Q8 -0.03 0.76 0.01 0.01 0.58 0.42 1.0
## Q9 0.47 -0.12 0.00 0.03 0.26 0.74 1.1
## Q10 0.05 0.05 0.77 -0.03 0.58 0.42 1.0
## Q11 0.13 0.00 0.15 -0.53 0.29 0.71 1.3
## Q12 -0.14 0.66 -0.04 0.05 0.45 0.55 1.1
## Q13 -0.68 0.00 0.11 0.01 0.53 0.47 1.1
## Q14 -0.18 -0.02 0.64 0.02 0.51 0.49 1.2
## Q15 0.04 -0.05 0.07 0.69 0.50 0.50 1.0
## Q16 0.10 -0.53 0.11 0.15 0.33 0.67 1.3
## Q17 0.76 -0.03 0.04 0.05 0.58 0.42 1.0
## Q18 0.11 0.00 0.80 0.00 0.58 0.42 1.0
## Q19 -0.14 0.15 0.21 -0.40 0.34 0.66 2.2
## Q20 0.11 0.79 -0.01 -0.07 0.64 0.36 1.1
## Q21 -0.66 0.13 0.13 0.17 0.55 0.45 1.3
## Q22 0.11 0.06 -0.29 0.13 0.14 0.86 1.8
## Q23 0.11 0.04 -0.01 0.51 0.29 0.71 1.1
## Q24 0.20 0.49 0.08 0.05 0.27 0.73 1.4
## Q25 0.77 0.06 -0.01 0.02 0.59 0.41 1.0
## Q26 -0.08 0.03 0.68 0.04 0.52 0.48 1.0
## Q27 0.09 0.07 0.11 -0.51 0.28 0.72 1.2
## Q28 -0.10 0.27 0.24 -0.13 0.23 0.77 2.8
## Q29 0.59 0.18 0.06 0.16 0.38 0.62 1.4
## Q30 -0.02 -0.13 0.51 0.00 0.25 0.75 1.1
## Q31 0.10 0.07 0.07 0.70 0.51 0.49 1.1
## Q32 -0.08 0.67 0.06 0.02 0.50 0.50 1.0
##
## ML1 ML2 ML4 ML3
## SS loadings 4.02 3.30 3.38 2.75
## Proportion Var 0.13 0.10 0.11 0.09
## Cumulative Var 0.13 0.23 0.33 0.42
## Proportion Explained 0.30 0.25 0.25 0.20
## Cumulative Proportion 0.30 0.54 0.80 1.00
##
## With factor correlations of
## ML1 ML2 ML4 ML3
## ML1 1.00 -0.14 -0.38 0.18
## ML2 -0.14 1.00 0.25 -0.17
## ML4 -0.38 0.25 1.00 -0.05
## ML3 0.18 -0.17 -0.05 1.00
##
## Mean item complexity = 1.3
## Test of the hypothesis that 4 factors are sufficient.
##
## The degrees of freedom for the null model are 496 and the objective function was 12.74
## The degrees of freedom for the model are 374 and the objective function was 2.18
##
## The root mean square of the residuals (RMSR) is 0.04
## The df corrected root mean square of the residuals is 0.05
##
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy
## ML1 ML2 ML4 ML3
## Correlation of (regression) scores with factors 0.94 0.93 0.93 0.91
## Multiple R square of scores with factors 0.89 0.87 0.87 0.82
## Minimum correlation of possible factor scores 0.78 0.74 0.74 0.64
You have successfully used two extraction methods and created your first EFA models! Observe the communalities and keep in mind that variables with a relatively high communality are being accounted fairly well by the four factors you chose to extract. Recall that a variable’s communality represents the percentage of the variable’s variance that can be explained by the factors, while the unique variance is due to variable’s own idiosyncrasy and not to the common factors.
Let’s look at another popular extraction method, Principal Axis Factoring (PAF). PAF’s main idea is that communality has a central role in extracting factors, since it can be interpreted as a measure of an item’s relation to all other items. An iterative approach is adopted. Initially, an estimate of the common variance is given in which the communalities are less than 1. After replacing the main diagonal of the correlation matrix (which usually consists of ones) with these estimates of the communalities, the new correlation matrix is updated and further replacements are repeated based on the new communalities until a number of iterations is reached or the communalities converge to a point that there is too little difference between two consecutive communalities.
# Use PAF on hsq_polychoric.
hsq_correl_pa <- fa(hsq_polychoric, nfactors = 4, fm = "pa")
# Sort the communalities of hsq_correl_pa.
sort(hsq_correl_pa$communality, decreasing = TRUE)
## Q20 Q17 Q25 Q18 Q8 Q10 Q21
## 0.6126774 0.5915220 0.5837439 0.5583484 0.5575454 0.5499162 0.5374533
## Q26 Q14 Q32 Q13 Q31 Q15 Q1
## 0.5246559 0.5189606 0.5184342 0.5011976 0.5003349 0.4972556 0.4790009
## Q12 Q5 Q2 Q4 Q29 Q7 Q6
## 0.4570093 0.4327538 0.4079485 0.4069703 0.3949128 0.3650824 0.3649526
## Q3 Q19 Q16 Q11 Q27 Q24 Q23
## 0.3634246 0.3472616 0.3225140 0.3018913 0.2992410 0.2866446 0.2719056
## Q30 Q9 Q28 Q22
## 0.2709174 0.2671010 0.2415143 0.1277128
# Sort the uniqueness of hsq_correl_pa.
sort(hsq_correl_pa$uniqueness, decreasing = TRUE)
## Q22 Q28 Q9 Q30 Q23 Q24 Q27
## 0.8722872 0.7584857 0.7328990 0.7290826 0.7280944 0.7133554 0.7007590
## Q11 Q16 Q19 Q3 Q6 Q7 Q29
## 0.6981087 0.6774860 0.6527384 0.6365754 0.6350474 0.6349176 0.6050872
## Q4 Q2 Q5 Q12 Q1 Q15 Q31
## 0.5930297 0.5920515 0.5672462 0.5429907 0.5209991 0.5027444 0.4996651
## Q13 Q32 Q14 Q26 Q21 Q10 Q8
## 0.4988024 0.4815658 0.4810394 0.4753441 0.4625467 0.4500838 0.4424546
## Q18 Q25 Q17 Q20
## 0.4416516 0.4162561 0.4084780 0.3873226
You have created an EFA model with PAF, a method that we did NOT see in the video! PAF is one of the most commonly used methods of extraction in EFA.
Let’s briefly visit the three tests for deciding on the number of factors to retain.
In this exercise, you will use the correlation matrix, hsq_polychoric, computed based on the initial dataset, hsq, that has 1069 observations.
# Check out the scree test and the Kaiser-Guttman criterion.
scree(hsq_polychoric)
# Use parallel analysis for estimation with the minres extraction method.
fa.parallel(hsq_polychoric, n.obs = 1069, fm = "minres", fa = "fa")
## Parallel analysis suggests that the number of factors = 7 and the number of components = NA
# Use parallel analysis for estimation with the mle extraction method.
fa.parallel(hsq_polychoric, n.obs = 1069, fm = "mle", fa = "fa")
## Parallel analysis suggests that the number of factors = 7 and the number of components = NA
You successfully applied the tests for determining the right number of factors. (Although not necessary for this exercise, fa.parallel() also accepts the original data matrix, hsq, as its argument. In that case, you would have to set the cor argument of fa.parallel() to poly to prepare and use the correlation matrix.)
Based on the three tests that you conducted in the previous exercise, how many factors would you decide to retain? The answer is 4
According to Martin et al. (2003), the HSQ represents a comprehensive self-report measure of everyday functions of humor. Following Martin et al.’s (2003) theory, humor is directly related to psychosocial well-being, i.e. humor is a social phenomenon and the different humor styles reflect different social traits of the individual, such as social control, status maintenance and group cohesion.
Let’s apply some rotation methods offered by the fa() function, in order to check whether Martin et al’s (2003) theory cofirms the observations in the HSQ dataset. For completing the exercise, the EFA model object f_hsq and the correlation matrix hsq_polychoric are at your disposal. Recall that hsq_polychoric, was calculated with the mixedCor() function on our initial dataset, hsq.
# Check the default rotation method.
f_hsq$rotation
## [1] "oblimin"
# Try Promax with 4 factors.
f_hsq_promax <- fa(hsq_polychoric, nfactors = 4, rotate = "promax")
# Now, try Varimax, again with 4 factors.
f_hsq_varimax <- fa(hsq_polychoric, nfactors = 4, rotate = "varimax")
Now you could try on the R console to compare the two objects, f_hsq_promax and f_hsq_varimax, and see what the differences in their loadings are.
Based on the last video and the HSQ dataset at hand, which rotation method is most suitable for arriving at the most interpretable EFA model?
Oblimin
The answer is EFA on the HSQ dataset has shown that the factors are correlated, and oblique rotations are far more complex than orthogonal rotations because the factor axes can take up any position in factor space.
The study of factor loadings offered by our EFA models is a valuable tool for interpreting the various humor styles. Even more helpful is the guidance of the path diagram. In this exercise, you are asked to create, retrieve both f_hsq’s factor loadings matrix and path diagram. Recall that the loaded f_hsq object represents an EFA model of 4 factors. Below are the questionnaire items as they were initially grouped, based on the type of humour that they encode:
affiliative: ‘Q1’, ‘Q5’, ‘Q9’, ‘Q13’, ‘Q17’, ‘Q21’, ‘Q25’, ‘Q29’ self-enhancing: ‘Q2’, ‘Q6’, ‘Q10’, ‘Q14’, ‘Q18’, ‘Q22’, ‘Q26’, ‘Q30’ aggressive: ‘Q3’, ‘Q7’, ‘Q11’, ‘Q15’, ‘Q19’, ‘Q23’, ‘Q27’, ‘Q31’ self-defeating: ‘Q4’, ‘Q8’, ‘Q12’, ‘Q16’, ‘Q20’, ‘Q24’, ‘Q28’, ‘Q32’
# Check the factor loadings.
print(f_hsq$loadings, cut = 0)
##
## Loadings:
## MR1 MR2 MR4 MR3
## Q1 0.675 -0.055 -0.005 0.029
## Q2 -0.085 -0.023 0.604 -0.034
## Q3 -0.113 0.130 0.082 -0.512
## Q4 0.018 0.635 0.023 0.002
## Q5 -0.599 0.066 0.095 -0.015
## Q6 -0.257 -0.038 0.462 -0.040
## Q7 -0.166 -0.030 0.002 0.607
## Q8 -0.027 0.741 0.009 0.007
## Q9 0.485 -0.124 0.011 0.016
## Q10 0.034 0.056 0.736 -0.020
## Q11 0.139 -0.018 0.163 -0.541
## Q12 -0.142 0.663 -0.045 0.054
## Q13 -0.641 0.006 0.143 0.005
## Q14 -0.173 -0.018 0.644 0.022
## Q15 0.055 -0.044 0.078 0.688
## Q16 0.123 -0.523 0.118 0.126
## Q17 0.769 -0.035 0.043 0.050
## Q18 0.107 0.005 0.780 -0.004
## Q19 -0.122 0.141 0.229 -0.412
## Q20 0.099 0.779 -0.021 -0.082
## Q21 -0.641 0.131 0.153 0.156
## Q22 0.136 0.059 -0.267 0.105
## Q23 0.124 0.036 0.004 0.491
## Q24 0.218 0.509 0.083 0.042
## Q25 0.761 0.061 -0.020 0.007
## Q26 -0.078 0.033 0.685 0.032
## Q27 0.103 0.058 0.114 -0.530
## Q28 -0.075 0.272 0.248 -0.155
## Q29 0.607 0.182 0.068 0.151
## Q30 0.005 -0.133 0.538 -0.008
## Q31 0.107 0.066 0.080 0.693
## Q32 -0.074 0.694 0.051 0.023
##
## MR1 MR2 MR4 MR3
## SS loadings 3.768 3.241 3.233 2.688
## Proportion Var 0.118 0.101 0.101 0.084
## Cumulative Var 0.118 0.219 0.320 0.404
# Create the path diagram of the latent factors.
fa.diagram(f_hsq)
Now, you have all the means for interpreting which factors underlie HSQ, and, therefore, decode what the groupings of variables that load on these factors represent! Note also, the curved arrow connection mentioned in the video between
url2 <- 'https://assets.datacamp.com/production/course_4249/datasets/SD3.RDS'
sdt_test <- readRDS(gzcon(url(url2)))
dim(sdt_test)
## [1] 100 27
sdt_sub_correl <- hetcor(sdt_test)
# Explore sdt_sub_correl.
str(sdt_sub_correl)
## List of 7
## $ correlations: num [1:27, 1:27] 1 0.184 0.102 0.217 0.369 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
## .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
## $ type : chr [1:27, 1:27] "" "Pearson" "Pearson" "Pearson" ...
## $ NA.method : chr "complete.obs"
## $ ML : logi FALSE
## $ std.errors : num [1:27, 1:27] 0 0.0969 0.0993 0.0956 0.0868 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
## .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
## $ n : int 100
## $ tests : num [1:27, 1:27] 0.00 5.78e-13 1.55e-16 8.63e-14 4.36e-14 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
## .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
## - attr(*, "class")= chr "hetcor"
# Get the correlation matrix of the sdt_sub_correl.
sdt_polychoric <- sdt_sub_correl$correlations
# Apply the Bartlett test on the correlation matrix.
cortest.bartlett(sdt_polychoric)
## Warning in cortest.bartlett(sdt_polychoric): n not specified, 100 used
## $chisq
## [1] 1019.442
##
## $p.value
## [1] 2.054927e-66
##
## $df
## [1] 351
# Check the KMO index.
KMO(sdt_polychoric)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = sdt_polychoric)
## Overall MSA = 0.82
## MSA for each item =
## M1 M2 M3 M4 M5 M6 M7 M8 M9 N1 N2 N3 N4 N5 N6
## 0.78 0.84 0.80 0.66 0.91 0.84 0.68 0.77 0.79 0.80 0.82 0.83 0.87 0.85 0.84
## N7 N8 N9 P1 P2 P3 P4 P5 P6 P7 P8 P9
## 0.80 0.81 0.89 0.89 0.64 0.87 0.52 0.81 0.88 0.52 0.63 0.85
Now, you have done the first step in doing EFA on sdt_sub. What do the factorability tests suggest? Could we further reduce the dimensionality of the dataset and still get some reliable insights?
Now, it is time to follow the next two steps; namely, to initially extract factors and then decide how many of them express the underlying patterns. For this exercise, we have loaded the correlation matrix object, sdt_polychoric, for you.
# Check out the scree test and the Kaiser-Guttman criterion
scree(sdt_polychoric)
# Use parallel analysis for estimation with the minres extraction method
fa.parallel(sdt_polychoric, n.obs = 100, fa = "fa")
## Parallel analysis suggests that the number of factors = 5 and the number of components = NA
# Perform EFA with MLE
fa(sdt_polychoric, fm = "ml", nfactors = 4)
## Factor Analysis using method = ml
## Call: fa(r = sdt_polychoric, nfactors = 4, fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML1 ML4 ML2 ML3 h2 u2 com
## M1 0.00 0.04 0.58 -0.19 0.343 0.66 1.2
## M2 0.24 0.41 0.19 0.15 0.511 0.49 2.4
## M3 -0.02 0.65 0.02 0.09 0.464 0.54 1.0
## M4 0.03 0.33 0.25 -0.13 0.220 0.78 2.3
## M5 0.18 0.18 0.55 0.07 0.581 0.42 1.5
## M6 0.06 -0.10 0.85 0.05 0.749 0.25 1.0
## M7 0.10 0.17 0.44 -0.45 0.402 0.60 2.4
## M8 0.50 0.26 -0.03 -0.18 0.378 0.62 1.8
## M9 0.05 0.32 0.45 0.04 0.445 0.56 1.9
## N1 0.08 0.20 0.03 0.41 0.299 0.70 1.6
## N2 0.04 -0.16 -0.11 -0.50 0.343 0.66 1.3
## N3 0.22 0.06 0.01 0.62 0.529 0.47 1.3
## N4 -0.01 0.44 0.16 0.37 0.493 0.51 2.2
## N5 -0.06 0.58 0.11 0.17 0.432 0.57 1.3
## N6 -0.30 -0.30 0.10 -0.36 0.434 0.57 3.1
## N7 -0.19 0.35 0.22 0.22 0.243 0.76 3.1
## N8 -0.20 -0.06 -0.28 -0.33 0.375 0.62 2.7
## N9 0.75 0.00 0.01 -0.02 0.571 0.43 1.0
## P1 0.41 0.01 0.30 0.05 0.393 0.61 1.9
## P2 0.00 -0.13 -0.09 -0.21 0.098 0.90 2.0
## P3 0.40 -0.01 0.22 0.02 0.286 0.71 1.6
## P4 0.02 0.10 -0.11 0.32 0.123 0.88 1.5
## P5 0.56 0.03 0.08 0.07 0.394 0.61 1.1
## P6 0.63 -0.05 0.17 0.14 0.574 0.43 1.3
## P7 -0.42 0.13 0.19 -0.02 0.126 0.87 1.6
## P8 0.10 0.59 -0.18 -0.28 0.364 0.64 1.7
## P9 0.26 0.53 -0.05 0.08 0.471 0.53 1.5
##
## ML1 ML4 ML2 ML3
## SS loadings 2.93 2.91 2.72 2.09
## Proportion Var 0.11 0.11 0.10 0.08
## Cumulative Var 0.11 0.22 0.32 0.39
## Proportion Explained 0.28 0.27 0.26 0.20
## Cumulative Proportion 0.28 0.55 0.80 1.00
##
## With factor correlations of
## ML1 ML4 ML2 ML3
## ML1 1.00 0.41 0.46 0.24
## ML4 0.41 1.00 0.30 0.25
## ML2 0.46 0.30 1.00 0.19
## ML3 0.24 0.25 0.19 1.00
##
## Mean item complexity = 1.8
## Test of the hypothesis that 4 factors are sufficient.
##
## The degrees of freedom for the null model are 351 and the objective function was 11.43
## The degrees of freedom for the model are 249 and the objective function was 3.31
##
## The root mean square of the residuals (RMSR) is 0.06
## The df corrected root mean square of the residuals is 0.07
##
## Fit based upon off diagonal values = 0.95
## Measures of factor score adequacy
## ML1 ML4 ML2 ML3
## Correlation of (regression) scores with factors 0.91 0.90 0.92 0.87
## Multiple R square of scores with factors 0.83 0.81 0.85 0.75
## Minimum correlation of possible factor scores 0.66 0.62 0.71 0.51
Notice that although we chose to retain 4 factors for our EFA model, the questionnaire items of the SD3 data are assumed to express 3 personality items. EFA is an exploratory method and we could rebuild the EFA model with 3 factors and check out how well it fits our intuitions. Now, there are only two more steps are left for completing your EFA!
Finally, undertaking the interpretation of EFA means to focus on factor loadings and to prepare the path diagram. By observing the arrow connections between the underlying factors and the observed variables in the path diagram, you can clearly trace variable groupings.
Observe the path diagram and try to draw conclusions about the underlying factors in the dataset. Do the twenty seven statements of the short dark triad test correspond well to the three personality traits, machiavellianism (a manipulative attitude), narcissism (excessive self-love), and psychopathy (lack of empathy)?
# EFA with 4 factors.
f_sdt <- fa(sdt_polychoric, nfactors = 4)
# Check the factor loadings.
print(f_sdt$loadings, cut=0)
##
## Loadings:
## MR1 MR4 MR3 MR2
## M1 0.013 0.594 -0.015 -0.117
## M2 0.229 0.243 0.343 0.199
## M3 -0.006 -0.021 0.733 0.036
## M4 0.060 0.235 0.354 -0.162
## M5 0.246 0.476 0.136 0.130
## M6 0.152 0.723 -0.124 0.106
## M7 0.036 0.571 0.147 -0.408
## M8 0.492 -0.014 0.334 -0.231
## M9 0.012 0.542 0.193 0.159
## N1 -0.008 0.117 0.159 0.431
## N2 0.001 -0.037 -0.094 -0.563
## N3 0.215 -0.038 0.068 0.596
## N4 -0.070 0.181 0.433 0.394
## N5 -0.056 0.124 0.530 0.194
## N6 -0.227 0.037 -0.274 -0.377
## N7 -0.209 0.274 0.254 0.294
## N8 -0.208 -0.266 -0.007 -0.362
## N9 0.711 0.037 0.051 -0.032
## P1 0.416 0.321 -0.063 0.128
## P2 -0.012 -0.109 -0.036 -0.278
## P3 0.432 0.144 0.040 0.011
## P4 0.112 -0.255 0.122 0.293
## P5 0.552 0.083 -0.004 0.119
## P6 0.604 0.183 -0.028 0.139
## P7 -0.386 0.141 0.151 -0.049
## P8 0.018 -0.004 0.494 -0.186
## P9 0.316 -0.119 0.563 0.079
##
## MR1 MR4 MR3 MR2
## SS loadings 2.373 2.324 2.241 2.023
## Proportion Var 0.088 0.086 0.083 0.075
## Cumulative Var 0.088 0.174 0.257 0.332
# Create the path diagram of the latent factors.
fa.diagram(f_sdt)
The path diagram shows that out of the 4 factors we extracted, only one of them is clearly mapped to machiavellianism (the M indices). One factor is isolated and the two others express a mixture of personality traits. For practicing your newly-acquired knowledge, you could take different samples of the sdt subset, build you EFA models and see whether the interpretation you draw for the current EFA model stays the same!
Our course came to its end. Now you are in a position to choose between the most well-known dimensionality reduction techniques for a range of different data types. You have solved a number of challenging code exercises and you are able to use R for solving a group of data science problems related to datasets with high dimensionality. Our last multiple choice exercise asks you to answer the simple (by now) but critical question that you will be facing many times in the future. When do we use each of the taught methods?
Check the option that provides the correct and most accurate answer to the following question:
When the data is numeric and you choose to retain those components that explain a big part of the whole variance in the data you can use PCA. When the data is numeric and you need a computationally more efficient solution you can use N-NMF. When the data is numeric but you need expert’s domain knowledge for the parametrisation and the interpretation you can use EFA.
All 3 statements for the methods characterise them appropriately! Now, you are about to start your journey of using the right dimensionality reduction methods for the multivariate datasets that you will come across with the right amount of confidence!