Exploratory Factor Analysis

1. Intro

Become familiar with exploratory factor analysis (EFA), another dimensionality reduction technique that is a natural extension to PCA.

2. The Humor Styles Questionnaire dataset

Two data frames - hsq and hsq_correl - have been loaded. hsq contains the Humor Styles Questionnaire [HSQ] dataset, which includes responses from 1071 participants on 32 questions. We also calculated the polychoric correlation for you using the mixedCor() function of the psych package:

hsq_correl <- mixedCor(humor_styles, c=NULL, p=1:32)

Polychoric correlation is the correlation between ordinal variables. Above, we indicated that columns 1 to 32 are ordinal by specifying p, and that there are no numeric variables by setting c to NULL.

Another way of calculating polychoric correlations is by using the hetcor() function of the polycor package. It stores the correlation matrix in the correlations attribute of the calculated object.

(1) Packages Loadings

library(psych)
library(tidyverse)
library(DT)

(2) Data Overview

# Check Dataset
url <- "https://assets.datacamp.com/production/course_4249/datasets/humor_dataset.csv"

hsq <- read.csv(url, sep = ";")
datatable(head(hsq))

# Check out the dimensionality of hsq.
dim(hsq)

## [1] 1071   39

(3) Calculated Polychoric Correlation

hsq_correl <- mixedCor(hsq, c=NULL, p=1:32)

## Warning in matpLower(x, nvar, gminx, gmaxx, gminy, gmaxy): 496 cells were
## adjusted for 0 values using the correction for continuity. Examine your
## data carefully.

str(hsq_correl)

## List of 6
##  $ rho  : num [1:32, 1:32] 1 -0.2094 -0.1772 -0.0945 -0.4466 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ rx   : NULL
##  $ poly :List of 4
##   ..$ rho  : num [1:32, 1:32] 1 -0.2094 -0.1772 -0.0945 -0.4466 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   ..$ tau  : num [1:32, 1:6] -2.77 -2.77 -2.9 -3.11 -2.9 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. .. ..$ : chr [1:6] "1" "2" "3" "4" ...
##   ..$ n.obs: int 1071
##   ..$ Call : language polychoric(x = data[, p], smooth = smooth, global = global, weight = weight,      correct = correct)
##   ..- attr(*, "class")= chr [1:2] "psych" "poly"
##  $ tetra:List of 2
##   ..$ rho: NULL
##   ..$ tau: NULL
##  $ rpd  : NULL
##  $ Call : language mixedCor(data = hsq, c = NULL, p = 1:32)
##  - attr(*, "class")= chr [1:2] "psych" "mixed"

# Getting the correlation matrix of the dataset.
hsq_polychoric <- hsq_correl$rho

# Explore the correlation structure of the dataset.
library(ggcorrplot)
ggcorrplot(hsq_polychoric)

3. How Factorable is our Dataset?

As mentioned in the video, before reducing dimensions with EFA, we first need to make sure that our dataset is factorable. In other words, the first step in performing EFA is to check whether it is even worth doing it. This dilemma is captured by the following question: Is there sufficient correlation among the observed variables of our dataset to allow for dimensionality reduction in the first place?

hsq_polychoric, calculated with the hetcor() function of the polycor package, is the correlation matrix of the Humor Styles Questionnaire [HSQ] dataset that you will be working throughout this and part of the next chapter.

In this exercise, your mission is to decide whether HSQ is factorable enough to allow an EFA.

# Apply the Bartlett test on the correlation matrix.
library(polycor)

## 
## Attaching package: 'polycor'

## The following object is masked from 'package:psych':
## 
##     polyserial

cortest.bartlett(hsq_polychoric)

## Warning in cortest.bartlett(hsq_polychoric): n not specified, 100 used

## $chisq
## [1] 1114.409
## 
## $p.value
## [1] 1.610583e-49
## 
## $df
## [1] 496

# Check the KMO index.

KMO(hsq_polychoric)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = hsq_polychoric)
## Overall MSA =  0.87
## MSA for each item = 
##   Q1   Q2   Q3   Q4   Q5   Q6   Q7   Q8   Q9  Q10  Q11  Q12  Q13  Q14  Q15 
## 0.94 0.93 0.91 0.90 0.91 0.88 0.82 0.86 0.95 0.86 0.78 0.90 0.85 0.93 0.82 
##  Q16  Q17  Q18  Q19  Q20  Q21  Q22  Q23  Q24  Q25  Q26  Q27  Q28  Q29  Q30 
## 0.85 0.87 0.83 0.89 0.83 0.87 0.84 0.81 0.84 0.83 0.89 0.83 0.93 0.87 0.81 
##  Q31  Q32 
## 0.81 0.91

Both the p.value attribute of cortest.bartlett()’s output is very much lower than 0.05 and the MSA attribute of KMO()’s output, 0.87, is close to 1, which means that they both recommend that EFA.

4. EFA with MinRes and MLE

t is about time to turn our attention to creating our first EFA model object. As mentioned in the video, the default extraction method in the fa() function of the psych package is minimum residuals, minres. Recall that fa() takes as its main argument either a dataframe or a correlation matrix. In this exercise, your job is to manipulate the values of the function’s default arguments and inspect the EFA model object’s attributes to find out which of the variables load well on the 4 factors that we choose to extract below.

For conducting EFA, you will use the correlation matrix hsq_polychoric, loaded in your workspace and calculated with the mixedCor() function on our initial dataset, hsq.

# EFA with 4 factors.
f_hsq <- fa(hsq_polychoric, nfactors = 4)

## Loading required namespace: GPArotation

# Inspect the resulting EFA object.
str(f_hsq)

## List of 44
##  $ residual     : num [1:32, 1:32] 0.5202 0.01625 0.02522 -0.00162 0.00493 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ dof          : num 374
##  $ ENull        : num NA
##  $ chi          : num NA
##  $ rms          : num 0.0411
##  $ nh           : logi NA
##  $ EPVAL        : num NA
##  $ crms         : num 0.0473
##  $ EBIC         : num NA
##  $ ESABIC       : num NA
##  $ fit          : num 0.849
##  $ fit.off      : num 0.969
##  $ sd           : num 0.0397
##  $ factors      : num 4
##  $ complexity   : Named num [1:32] 1.02 1.05 1.29 1 1.08 ...
##   ..- attr(*, "names")= chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ n.obs        : logi NA
##  $ PVAL         : logi NA
##  $ objective    : num 2.19
##  $ criteria     : Named num [1:3] 2.19 NA NA
##   ..- attr(*, "names")= chr [1:3] "objective" "" ""
##  $ Call         : language fa(r = hsq_polychoric, nfactors = 4)
##  $ null.model   : num 12.7
##  $ null.dof     : num 496
##  $ r.scores     : num [1:4, 1:4] 1 -0.166 -0.434 0.212 -0.166 ...
##  $ R2           : num [1:4] 0.886 0.867 0.866 0.821
##  $ valid        : num [1:4] 0.93 0.915 0.902 0.889
##  $ score.cor    : num [1:4, 1:4] 1 -0.198 -0.44 0.229 -0.198 ...
##  $ weights      : num [1:32, 1:4] 0.14289 -0.01175 -0.01633 0.00872 -0.13143 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
##  $ rotation     : chr "oblimin"
##  $ communality  : Named num [1:32] 0.48 0.408 0.363 0.407 0.433 ...
##   ..- attr(*, "names")= chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ communalities: Named num [1:32] 0.48 0.408 0.363 0.407 0.433 ...
##   ..- attr(*, "names")= chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ uniquenesses : Named num [1:32] 0.52 0.592 0.637 0.593 0.567 ...
##   ..- attr(*, "names")= chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ values       : num [1:32] 6.37 2.96 2.4 1.74 0.78 ...
##  $ e.values     : num [1:32] 6.91 3.5 2.96 2.28 1.45 ...
##  $ loadings     : 'loadings' num [1:32, 1:4] 0.6746 -0.0849 -0.1127 0.0182 -0.5985 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
##  $ model        : num [1:32, 1:32] 0.4798 -0.2256 -0.2024 -0.0929 -0.4516 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ fm           : chr "minres"
##  $ rot.mat      : num [1:4, 1:4] 0.534 -0.503 -0.279 -0.764 0.308 ...
##  $ Phi          : num [1:4, 1:4] 1 -0.142 -0.375 0.18 -0.142 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
##   .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
##  $ Structure    : 'loadings' num [1:32, 1:4] 0.6895 -0.3145 -0.2541 -0.0802 -0.6464 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
##  $ method       : chr "regression"
##  $ R2.scores    : Named num [1:4] 0.886 0.867 0.866 0.821
##   ..- attr(*, "names")= chr [1:4] "MR1" "MR2" "MR4" "MR3"
##  $ r            : num [1:32, 1:32] 1 -0.2094 -0.1772 -0.0945 -0.4466 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##   .. ..$ : chr [1:32] "Q1" "Q2" "Q3" "Q4" ...
##  $ fn           : chr "fa"
##  $ Vaccounted   : num [1:5, 1:4] 3.973 0.124 0.124 0.295 0.295 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "SS loadings" "Proportion Var" "Cumulative Var" "Proportion Explained" ...
##   .. ..$ : chr [1:4] "MR1" "MR2" "MR4" "MR3"
##  - attr(*, "class")= chr [1:2] "psych" "fa"

# Use maximum likelihood for extracting factors.
fa(hsq_polychoric, nfactors = 4, fm = "mle")

## Factor Analysis using method =  ml
## Call: fa(r = hsq_polychoric, nfactors = 4, fm = "mle")
## Standardized loadings (pattern matrix) based upon correlation matrix
##       ML1   ML2   ML4   ML3   h2   u2 com
## Q1   0.67 -0.05 -0.01  0.04 0.48 0.52 1.0
## Q2  -0.09 -0.03  0.61 -0.04 0.42 0.58 1.1
## Q3  -0.12  0.13  0.08 -0.50 0.35 0.65 1.3
## Q4   0.01  0.62  0.01  0.00 0.39 0.61 1.0
## Q5  -0.60  0.07  0.08 -0.01 0.42 0.58 1.1
## Q6  -0.28 -0.03  0.42 -0.03 0.34 0.66 1.8
## Q7  -0.17 -0.03 -0.01  0.61 0.37 0.63 1.2
## Q8  -0.03  0.76  0.01  0.01 0.58 0.42 1.0
## Q9   0.47 -0.12  0.00  0.03 0.26 0.74 1.1
## Q10  0.05  0.05  0.77 -0.03 0.58 0.42 1.0
## Q11  0.13  0.00  0.15 -0.53 0.29 0.71 1.3
## Q12 -0.14  0.66 -0.04  0.05 0.45 0.55 1.1
## Q13 -0.68  0.00  0.11  0.01 0.53 0.47 1.1
## Q14 -0.18 -0.02  0.64  0.02 0.51 0.49 1.2
## Q15  0.04 -0.05  0.07  0.69 0.50 0.50 1.0
## Q16  0.10 -0.53  0.11  0.15 0.33 0.67 1.3
## Q17  0.76 -0.03  0.04  0.05 0.58 0.42 1.0
## Q18  0.11  0.00  0.80  0.00 0.58 0.42 1.0
## Q19 -0.14  0.15  0.21 -0.40 0.34 0.66 2.2
## Q20  0.11  0.79 -0.01 -0.07 0.64 0.36 1.1
## Q21 -0.66  0.13  0.13  0.17 0.55 0.45 1.3
## Q22  0.11  0.06 -0.29  0.13 0.14 0.86 1.8
## Q23  0.11  0.04 -0.01  0.51 0.29 0.71 1.1
## Q24  0.20  0.49  0.08  0.05 0.27 0.73 1.4
## Q25  0.77  0.06 -0.01  0.02 0.59 0.41 1.0
## Q26 -0.08  0.03  0.68  0.04 0.52 0.48 1.0
## Q27  0.09  0.07  0.11 -0.51 0.28 0.72 1.2
## Q28 -0.10  0.27  0.24 -0.13 0.23 0.77 2.8
## Q29  0.59  0.18  0.06  0.16 0.38 0.62 1.4
## Q30 -0.02 -0.13  0.51  0.00 0.25 0.75 1.1
## Q31  0.10  0.07  0.07  0.70 0.51 0.49 1.1
## Q32 -0.08  0.67  0.06  0.02 0.50 0.50 1.0
## 
##                        ML1  ML2  ML4  ML3
## SS loadings           4.02 3.30 3.38 2.75
## Proportion Var        0.13 0.10 0.11 0.09
## Cumulative Var        0.13 0.23 0.33 0.42
## Proportion Explained  0.30 0.25 0.25 0.20
## Cumulative Proportion 0.30 0.54 0.80 1.00
## 
##  With factor correlations of 
##       ML1   ML2   ML4   ML3
## ML1  1.00 -0.14 -0.38  0.18
## ML2 -0.14  1.00  0.25 -0.17
## ML4 -0.38  0.25  1.00 -0.05
## ML3  0.18 -0.17 -0.05  1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  496  and the objective function was  12.74
## The degrees of freedom for the model are 374  and the objective function was  2.18 
## 
## The root mean square of the residuals (RMSR) is  0.04 
## The df corrected root mean square of the residuals is  0.05 
## 
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy             
##                                                    ML1  ML2  ML4  ML3
## Correlation of (regression) scores with factors   0.94 0.93 0.93 0.91
## Multiple R square of scores with factors          0.89 0.87 0.87 0.82
## Minimum correlation of possible factor scores     0.78 0.74 0.74 0.64

You have successfully used two extraction methods and created your first EFA models! Observe the communalities and keep in mind that variables with a relatively high communality are being accounted fairly well by the four factors you chose to extract. Recall that a variable’s communality represents the percentage of the variable’s variance that can be explained by the factors, while the unique variance is due to variable’s own idiosyncrasy and not to the common factors.

5. EFA with Principal Axis Factoring

Let’s look at another popular extraction method, Principal Axis Factoring (PAF). PAF’s main idea is that communality has a central role in extracting factors, since it can be interpreted as a measure of an item’s relation to all other items. An iterative approach is adopted. Initially, an estimate of the common variance is given in which the communalities are less than 1. After replacing the main diagonal of the correlation matrix (which usually consists of ones) with these estimates of the communalities, the new correlation matrix is updated and further replacements are repeated based on the new communalities until a number of iterations is reached or the communalities converge to a point that there is too little difference between two consecutive communalities.

# Use PAF on hsq_polychoric.
hsq_correl_pa <- fa(hsq_polychoric, nfactors = 4, fm = "pa") 

# Sort the communalities of hsq_correl_pa.
sort(hsq_correl_pa$communality, decreasing = TRUE)

##       Q20       Q17       Q25       Q18        Q8       Q10       Q21 
## 0.6126774 0.5915220 0.5837439 0.5583484 0.5575454 0.5499162 0.5374533 
##       Q26       Q14       Q32       Q13       Q31       Q15        Q1 
## 0.5246559 0.5189606 0.5184342 0.5011976 0.5003349 0.4972556 0.4790009 
##       Q12        Q5        Q2        Q4       Q29        Q7        Q6 
## 0.4570093 0.4327538 0.4079485 0.4069703 0.3949128 0.3650824 0.3649526 
##        Q3       Q19       Q16       Q11       Q27       Q24       Q23 
## 0.3634246 0.3472616 0.3225140 0.3018913 0.2992410 0.2866446 0.2719056 
##       Q30        Q9       Q28       Q22 
## 0.2709174 0.2671010 0.2415143 0.1277128

# Sort the uniqueness of hsq_correl_pa.
sort(hsq_correl_pa$uniqueness, decreasing = TRUE)

##       Q22       Q28        Q9       Q30       Q23       Q24       Q27 
## 0.8722872 0.7584857 0.7328990 0.7290826 0.7280944 0.7133554 0.7007590 
##       Q11       Q16       Q19        Q3        Q6        Q7       Q29 
## 0.6981087 0.6774860 0.6527384 0.6365754 0.6350474 0.6349176 0.6050872 
##        Q4        Q2        Q5       Q12        Q1       Q15       Q31 
## 0.5930297 0.5920515 0.5672462 0.5429907 0.5209991 0.5027444 0.4996651 
##       Q13       Q32       Q14       Q26       Q21       Q10        Q8 
## 0.4988024 0.4815658 0.4810394 0.4753441 0.4625467 0.4500838 0.4424546 
##       Q18       Q25       Q17       Q20 
## 0.4416516 0.4162561 0.4084780 0.3873226

You have created an EFA model with PAF, a method that we did NOT see in the video! PAF is one of the most commonly used methods of extraction in EFA.

6. Determining the number of factors

Let’s briefly visit the three tests for deciding on the number of factors to retain.

In this exercise, you will use the correlation matrix, hsq_polychoric, computed based on the initial dataset, hsq, that has 1069 observations.

# Check out the scree test and the Kaiser-Guttman criterion.
scree(hsq_polychoric)

# Use parallel analysis for estimation with the minres extraction method.
fa.parallel(hsq_polychoric, n.obs = 1069, fm = "minres", fa = "fa")

## Parallel analysis suggests that the number of factors =  7  and the number of components =  NA

# Use parallel analysis for estimation with the mle extraction method.
fa.parallel(hsq_polychoric, n.obs = 1069, fm = "mle", fa = "fa")

## Parallel analysis suggests that the number of factors =  7  and the number of components =  NA

You successfully applied the tests for determining the right number of factors. (Although not necessary for this exercise, fa.parallel() also accepts the original data matrix, hsq, as its argument. In that case, you would have to set the cor argument of fa.parallel() to poly to prepare and use the correlation matrix.)

(1) Taking the actual decision

Based on the three tests that you conducted in the previous exercise, how many factors would you decide to retain? The answer is 4

7. Rotating the extracted factors

According to Martin et al. (2003), the HSQ represents a comprehensive self-report measure of everyday functions of humor. Following Martin et al.’s (2003) theory, humor is directly related to psychosocial well-being, i.e. humor is a social phenomenon and the different humor styles reflect different social traits of the individual, such as social control, status maintenance and group cohesion.

Let’s apply some rotation methods offered by the fa() function, in order to check whether Martin et al’s (2003) theory cofirms the observations in the HSQ dataset. For completing the exercise, the EFA model object f_hsq and the correlation matrix hsq_polychoric are at your disposal. Recall that hsq_polychoric, was calculated with the mixedCor() function on our initial dataset, hsq.

# Check the default rotation method.
f_hsq$rotation

## [1] "oblimin"

# Try Promax with 4 factors.
f_hsq_promax <- fa(hsq_polychoric, nfactors = 4, rotate = "promax")

# Now, try Varimax, again with 4 factors.
f_hsq_varimax <- fa(hsq_polychoric, nfactors = 4, rotate = "varimax")

Now you could try on the R console to compare the two objects, f_hsq_promax and f_hsq_varimax, and see what the differences in their loadings are.

(1) Which of the following statements are true?

Based on the last video and the HSQ dataset at hand, which rotation method is most suitable for arriving at the most interpretable EFA model?

Oblimin

The answer is EFA on the HSQ dataset has shown that the factors are correlated, and oblique rotations are far more complex than orthogonal rotations because the factor axes can take up any position in factor space.

8. Interpreting humor styles and visual aid

The study of factor loadings offered by our EFA models is a valuable tool for interpreting the various humor styles. Even more helpful is the guidance of the path diagram. In this exercise, you are asked to create, retrieve both f_hsq’s factor loadings matrix and path diagram. Recall that the loaded f_hsq object represents an EFA model of 4 factors. Below are the questionnaire items as they were initially grouped, based on the type of humour that they encode:

affiliative: ‘Q1’, ‘Q5’, ‘Q9’, ‘Q13’, ‘Q17’, ‘Q21’, ‘Q25’, ‘Q29’ self-enhancing: ‘Q2’, ‘Q6’, ‘Q10’, ‘Q14’, ‘Q18’, ‘Q22’, ‘Q26’, ‘Q30’ aggressive: ‘Q3’, ‘Q7’, ‘Q11’, ‘Q15’, ‘Q19’, ‘Q23’, ‘Q27’, ‘Q31’ self-defeating: ‘Q4’, ‘Q8’, ‘Q12’, ‘Q16’, ‘Q20’, ‘Q24’, ‘Q28’, ‘Q32’

# Check the factor loadings.
print(f_hsq$loadings, cut = 0)

## 
## Loadings:
##     MR1    MR2    MR4    MR3   
## Q1   0.675 -0.055 -0.005  0.029
## Q2  -0.085 -0.023  0.604 -0.034
## Q3  -0.113  0.130  0.082 -0.512
## Q4   0.018  0.635  0.023  0.002
## Q5  -0.599  0.066  0.095 -0.015
## Q6  -0.257 -0.038  0.462 -0.040
## Q7  -0.166 -0.030  0.002  0.607
## Q8  -0.027  0.741  0.009  0.007
## Q9   0.485 -0.124  0.011  0.016
## Q10  0.034  0.056  0.736 -0.020
## Q11  0.139 -0.018  0.163 -0.541
## Q12 -0.142  0.663 -0.045  0.054
## Q13 -0.641  0.006  0.143  0.005
## Q14 -0.173 -0.018  0.644  0.022
## Q15  0.055 -0.044  0.078  0.688
## Q16  0.123 -0.523  0.118  0.126
## Q17  0.769 -0.035  0.043  0.050
## Q18  0.107  0.005  0.780 -0.004
## Q19 -0.122  0.141  0.229 -0.412
## Q20  0.099  0.779 -0.021 -0.082
## Q21 -0.641  0.131  0.153  0.156
## Q22  0.136  0.059 -0.267  0.105
## Q23  0.124  0.036  0.004  0.491
## Q24  0.218  0.509  0.083  0.042
## Q25  0.761  0.061 -0.020  0.007
## Q26 -0.078  0.033  0.685  0.032
## Q27  0.103  0.058  0.114 -0.530
## Q28 -0.075  0.272  0.248 -0.155
## Q29  0.607  0.182  0.068  0.151
## Q30  0.005 -0.133  0.538 -0.008
## Q31  0.107  0.066  0.080  0.693
## Q32 -0.074  0.694  0.051  0.023
## 
##                  MR1   MR2   MR4   MR3
## SS loadings    3.768 3.241 3.233 2.688
## Proportion Var 0.118 0.101 0.101 0.084
## Cumulative Var 0.118 0.219 0.320 0.404

# Create the path diagram of the latent factors.
fa.diagram(f_hsq)

Now, you have all the means for interpreting which factors underlie HSQ, and, therefore, decode what the groupings of variables that load on these factors represent! Note also, the curved arrow connection mentioned in the video between

9. Humor Styles and hidden factors

Based on the path diagram of the previous exercise, you are in the position of tracing some interesting patterns of humor styles. Following Martin (2007), one of the special features of the HSQ is that it measures two positive, a) affiliative and b) self-enhancing, and two negative, c) aggressive and d) self-defeating, styles of humor.

Which of the extracted factors could measure the affiliative style? For answering the question correctly, you need to focus on the nature of the items related to each of the factors in the path diagram.

The answer is (1) MR1. As you may have noticed, the relevant items express a difficulty in sharing something humoristic with other people.

10. EFA: Case study

(1) Data Import

url2 <- 'https://assets.datacamp.com/production/course_4249/datasets/SD3.RDS'

sdt_test <- readRDS(gzcon(url(url2)))
dim(sdt_test)

## [1] 100  27

sdt_sub_correl <- hetcor(sdt_test)

(2) Factorability check

# Explore sdt_sub_correl.
str(sdt_sub_correl)

## List of 7
##  $ correlations: num [1:27, 1:27] 1 0.184 0.102 0.217 0.369 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##  $ type        : chr [1:27, 1:27] "" "Pearson" "Pearson" "Pearson" ...
##  $ NA.method   : chr "complete.obs"
##  $ ML          : logi FALSE
##  $ std.errors  : num [1:27, 1:27] 0 0.0969 0.0993 0.0956 0.0868 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##  $ n           : int 100
##  $ tests       : num [1:27, 1:27] 0.00 5.78e-13 1.55e-16 8.63e-14 4.36e-14 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##   .. ..$ : chr [1:27] "M1" "M2" "M3" "M4" ...
##  - attr(*, "class")= chr "hetcor"

# Get the correlation matrix of the sdt_sub_correl.
sdt_polychoric <- sdt_sub_correl$correlations

# Apply the Bartlett test on the correlation matrix.
cortest.bartlett(sdt_polychoric)

## Warning in cortest.bartlett(sdt_polychoric): n not specified, 100 used

## $chisq
## [1] 1019.442
## 
## $p.value
## [1] 2.054927e-66
## 
## $df
## [1] 351

# Check the KMO index.
KMO(sdt_polychoric)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = sdt_polychoric)
## Overall MSA =  0.82
## MSA for each item = 
##   M1   M2   M3   M4   M5   M6   M7   M8   M9   N1   N2   N3   N4   N5   N6 
## 0.78 0.84 0.80 0.66 0.91 0.84 0.68 0.77 0.79 0.80 0.82 0.83 0.87 0.85 0.84 
##   N7   N8   N9   P1   P2   P3   P4   P5   P6   P7   P8   P9 
## 0.80 0.81 0.89 0.89 0.64 0.87 0.52 0.81 0.88 0.52 0.63 0.85

Now, you have done the first step in doing EFA on sdt_sub. What do the factorability tests suggest? Could we further reduce the dimensionality of the dataset and still get some reliable insights?

(3) Extracting and choosing the number of factors

Now, it is time to follow the next two steps; namely, to initially extract factors and then decide how many of them express the underlying patterns. For this exercise, we have loaded the correlation matrix object, sdt_polychoric, for you.

# Check out the scree test and the Kaiser-Guttman criterion
scree(sdt_polychoric)

# Use parallel analysis for estimation with the minres extraction method
fa.parallel(sdt_polychoric, n.obs = 100, fa = "fa")

## Parallel analysis suggests that the number of factors =  5  and the number of components =  NA

# Perform EFA with MLE
fa(sdt_polychoric, fm = "ml", nfactors = 4)

## Factor Analysis using method =  ml
## Call: fa(r = sdt_polychoric, nfactors = 4, fm = "ml")
## Standardized loadings (pattern matrix) based upon correlation matrix
##      ML1   ML4   ML2   ML3    h2   u2 com
## M1  0.00  0.04  0.58 -0.19 0.343 0.66 1.2
## M2  0.24  0.41  0.19  0.15 0.511 0.49 2.4
## M3 -0.02  0.65  0.02  0.09 0.464 0.54 1.0
## M4  0.03  0.33  0.25 -0.13 0.220 0.78 2.3
## M5  0.18  0.18  0.55  0.07 0.581 0.42 1.5
## M6  0.06 -0.10  0.85  0.05 0.749 0.25 1.0
## M7  0.10  0.17  0.44 -0.45 0.402 0.60 2.4
## M8  0.50  0.26 -0.03 -0.18 0.378 0.62 1.8
## M9  0.05  0.32  0.45  0.04 0.445 0.56 1.9
## N1  0.08  0.20  0.03  0.41 0.299 0.70 1.6
## N2  0.04 -0.16 -0.11 -0.50 0.343 0.66 1.3
## N3  0.22  0.06  0.01  0.62 0.529 0.47 1.3
## N4 -0.01  0.44  0.16  0.37 0.493 0.51 2.2
## N5 -0.06  0.58  0.11  0.17 0.432 0.57 1.3
## N6 -0.30 -0.30  0.10 -0.36 0.434 0.57 3.1
## N7 -0.19  0.35  0.22  0.22 0.243 0.76 3.1
## N8 -0.20 -0.06 -0.28 -0.33 0.375 0.62 2.7
## N9  0.75  0.00  0.01 -0.02 0.571 0.43 1.0
## P1  0.41  0.01  0.30  0.05 0.393 0.61 1.9
## P2  0.00 -0.13 -0.09 -0.21 0.098 0.90 2.0
## P3  0.40 -0.01  0.22  0.02 0.286 0.71 1.6
## P4  0.02  0.10 -0.11  0.32 0.123 0.88 1.5
## P5  0.56  0.03  0.08  0.07 0.394 0.61 1.1
## P6  0.63 -0.05  0.17  0.14 0.574 0.43 1.3
## P7 -0.42  0.13  0.19 -0.02 0.126 0.87 1.6
## P8  0.10  0.59 -0.18 -0.28 0.364 0.64 1.7
## P9  0.26  0.53 -0.05  0.08 0.471 0.53 1.5
## 
##                        ML1  ML4  ML2  ML3
## SS loadings           2.93 2.91 2.72 2.09
## Proportion Var        0.11 0.11 0.10 0.08
## Cumulative Var        0.11 0.22 0.32 0.39
## Proportion Explained  0.28 0.27 0.26 0.20
## Cumulative Proportion 0.28 0.55 0.80 1.00
## 
##  With factor correlations of 
##      ML1  ML4  ML2  ML3
## ML1 1.00 0.41 0.46 0.24
## ML4 0.41 1.00 0.30 0.25
## ML2 0.46 0.30 1.00 0.19
## ML3 0.24 0.25 0.19 1.00
## 
## Mean item complexity =  1.8
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  351  and the objective function was  11.43
## The degrees of freedom for the model are 249  and the objective function was  3.31 
## 
## The root mean square of the residuals (RMSR) is  0.06 
## The df corrected root mean square of the residuals is  0.07 
## 
## Fit based upon off diagonal values = 0.95
## Measures of factor score adequacy             
##                                                    ML1  ML4  ML2  ML3
## Correlation of (regression) scores with factors   0.91 0.90 0.92 0.87
## Multiple R square of scores with factors          0.83 0.81 0.85 0.75
## Minimum correlation of possible factor scores     0.66 0.62 0.71 0.51

Notice that although we chose to retain 4 factors for our EFA model, the questionnaire items of the SD3 data are assumed to express 3 personality items. EFA is an exploratory method and we could rebuild the EFA model with 3 factors and check out how well it fits our intuitions. Now, there are only two more steps are left for completing your EFA!

(4) Factor rotation and interpretation

Finally, undertaking the interpretation of EFA means to focus on factor loadings and to prepare the path diagram. By observing the arrow connections between the underlying factors and the observed variables in the path diagram, you can clearly trace variable groupings.

Observe the path diagram and try to draw conclusions about the underlying factors in the dataset. Do the twenty seven statements of the short dark triad test correspond well to the three personality traits, machiavellianism (a manipulative attitude), narcissism (excessive self-love), and psychopathy (lack of empathy)?

# EFA with 4 factors.
f_sdt <- fa(sdt_polychoric, nfactors = 4)

# Check the factor loadings.
print(f_sdt$loadings, cut=0)

## 
## Loadings:
##    MR1    MR4    MR3    MR2   
## M1  0.013  0.594 -0.015 -0.117
## M2  0.229  0.243  0.343  0.199
## M3 -0.006 -0.021  0.733  0.036
## M4  0.060  0.235  0.354 -0.162
## M5  0.246  0.476  0.136  0.130
## M6  0.152  0.723 -0.124  0.106
## M7  0.036  0.571  0.147 -0.408
## M8  0.492 -0.014  0.334 -0.231
## M9  0.012  0.542  0.193  0.159
## N1 -0.008  0.117  0.159  0.431
## N2  0.001 -0.037 -0.094 -0.563
## N3  0.215 -0.038  0.068  0.596
## N4 -0.070  0.181  0.433  0.394
## N5 -0.056  0.124  0.530  0.194
## N6 -0.227  0.037 -0.274 -0.377
## N7 -0.209  0.274  0.254  0.294
## N8 -0.208 -0.266 -0.007 -0.362
## N9  0.711  0.037  0.051 -0.032
## P1  0.416  0.321 -0.063  0.128
## P2 -0.012 -0.109 -0.036 -0.278
## P3  0.432  0.144  0.040  0.011
## P4  0.112 -0.255  0.122  0.293
## P5  0.552  0.083 -0.004  0.119
## P6  0.604  0.183 -0.028  0.139
## P7 -0.386  0.141  0.151 -0.049
## P8  0.018 -0.004  0.494 -0.186
## P9  0.316 -0.119  0.563  0.079
## 
##                  MR1   MR4   MR3   MR2
## SS loadings    2.373 2.324 2.241 2.023
## Proportion Var 0.088 0.086 0.083 0.075
## Cumulative Var 0.088 0.174 0.257 0.332

# Create the path diagram of the latent factors.
fa.diagram(f_sdt)

The path diagram shows that out of the 4 factors we extracted, only one of them is clearly mapped to machiavellianism (the M indices). One factor is isolated and the two others express a mixture of personality traits. For practicing your newly-acquired knowledge, you could take different samples of the sdt subset, build you EFA models and see whether the interpretation you draw for the current EFA model stays the same!

11. When do we use each method?

Our course came to its end. Now you are in a position to choose between the most well-known dimensionality reduction techniques for a range of different data types. You have solved a number of challenging code exercises and you are able to use R for solving a group of data science problems related to datasets with high dimensionality. Our last multiple choice exercise asks you to answer the simple (by now) but critical question that you will be facing many times in the future. When do we use each of the taught methods?

Check the option that provides the correct and most accurate answer to the following question:

When the data is numeric and you choose to retain those components that explain a big part of the whole variance in the data you can use PCA. When the data is numeric and you need a computationally more efficient solution you can use N-NMF. When the data is numeric but you need expert’s domain knowledge for the parametrisation and the interpretation you can use EFA.

All 3 statements for the methods characterise them appropriately! Now, you are about to start your journey of using the right dimensionality reduction methods for the multivariate datasets that you will come across with the right amount of confidence!