We first load the necessary packages in R and read in our dataset prince.
library(tidyverse)
library(psych)
library(rela)
library(MASS)
library(parallel)
options(scipen = 999)
prince <- read.table("prince.dat", sep = "" , header = F)
Principa Components Analysis (PCA) is a method of data reduction, which seeks to find a number of linear combinations of variates less than the actual number of variates themselves, such that as much of the variation in the data as possible is accounted for.
This method has three key assumptions: - Sphericity or the existence of the identity matrix - Sampling adequacy or an appropriate number of observations relative to the number of variables being examined - Positive determinant of the correlation or variance-covariance matrices
We first compute the correlation matrix \(\boldsymbol{\rho}\) and the variance-covariance matrix \(\boldsymbol{\sum}\).
princeCor <- cor(prince[,1:4])
princeCov <- cov(prince[,1:4])
We now examine the data to assess whether the assumptions for PCA have been met before proceeding. We use the paf() function from the rela package.
assumptions <- paf(as.matrix(prince[,1:4]), eigcrit = 1, convcrit = .001)
To test for the first assumption, we perform Bartlett’s Test for Sphericity.
The null hypothesis for this test is that the intercorrelation matrix comes from a noncollinear populaton or simply that there is nocollinearity between the variables, which would render PCA impossible as it depends on the construction of a linear combination of the variables.
We use a significance level \(\alpha = .05\).
bartlettTest <- cortest.bartlett(princeCor, n = 98)
bartlettTest
$chisq
[1] 228.97
$p.value
[1] 0.00000000000000000000000000000000000000000000012692
$df
[1] 6
Bartlett’s Test follows a \(\chi^{2}_{p}\) distribution. Our p-value is very close to 0. Hence we reject the null hypothesis.
We proceed to evaluate the sampling adequacy. Using the assumptions variable calculated earlier, we extract the Kaiser-Meyer-Olkin (KMO) value.
print(assumptions$KMO)
[1] 0.63989
Our KMO value is 0.63989, which is some distance from 1. However, it is often recommended that a ratio of 20 observations to one variable will yield satisfactory results {Schumacker:2015tn}. In this case, we have 98 observations for four variables and therefore feel reasonably confident that we can proceed with the analysis.
We now test for positivity of the correlation matrix.
det(princeCor)
[1] 0.089415
Clearly, the determinant of the correlation matrix is positive. We have thus satisfied all the assumptions for PCA and proceed with the analysis.
We can use a scree plot to aid with the decision of how many components to include:
fa.parallel(prince[, 1:4], n.obs = 98, fm = "pa", fa = "pc")
You specified the number of subjects, implying a correlation matrix, but do not have a correlation matrix, correlations found A loading greater than abs(1) was detected. Examine the loadings carefully.The estimated weights for the factor scores are probably incorrect. Try a different factor extraction method.
An ultra-Heywood case was detected. Examine the results carefully
Parallel analysis suggests that the number of factors = NA and the number of components = 1
From this plot, we can see that the first component has the largest eigen value.
We now proceed to construct a PCA model, with a single principal component.
pcaModel1 <- principal(princeCor,rotate = "none")
pcaModel1
Principal Components Analysis
Call: principal(r = princeCor, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 h2 u2 com
V1 0.41 0.17 0.833 1
V2 0.95 0.91 0.089 1
V3 0.72 0.51 0.487 1
V4 0.94 0.87 0.126 1
PC1
SS loadings 2.46
Proportion Var 0.62
Mean item complexity = 1
Test of the hypothesis that 1 component is sufficient.
The root mean square of the residuals (RMSR) is 0.14
Fit based upon off diagonal values = 0.93
The SS loadings represent the eigenvalue for the single principal component, which has a value of 2.46, with the proportion of variance explained by this principal component equal to 62%, which leaves 38% of the variance unexplained.
We thus fit a second model, which allows for additional components.
pcaModel2 <- principal(princeCor, nfactors = 3, rotate = "none")
pcaModel2
Principal Components Analysis
Call: principal(r = princeCor, nfactors = 3, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 PC3 h2 u2 com
V1 0.41 0.87 0.26 1.00 0.000061 1.6
V2 0.95 -0.02 -0.22 0.96 0.040456 1.1
V3 0.72 -0.45 0.54 1.00 0.000350 2.6
V4 0.94 -0.02 -0.30 0.96 0.035164 1.2
PC1 PC2 PC3
SS loadings 2.46 0.96 0.49
Proportion Var 0.62 0.24 0.12
Cumulative Var 0.62 0.86 0.98
Proportion Explained 0.63 0.25 0.13
Cumulative Proportion 0.63 0.87 1.00
Mean item complexity = 1.6
Test of the hypothesis that 3 components are sufficient.
The root mean square of the residuals (RMSR) is 0.02
Fit based upon off diagonal values = 1
The eigenvalues for the components are \(\lambda_{1}\) = 2.46 (from previous computation), \(\lambda_{2}\) = 0.96 and \(\lambda_{3}\) = .49. Together these components account for 98% of the variation in the dataset.
The first PC was strongly correlated with variables X2 and X4, moderately correlated with variable X3 and weakly correlated with variable X1.
The second PC is strongly correlated with variable X1 and has negative correlations with variables X2, X3 and X4.
The third PC has positive correlations with variables X1 and X3 and negative correlations with variables X2 and X4.
We first load additional required packages to perform the Factor Analysis.
library(corpcor)
library(GPArotation)
We modify our previous dataset and convert it to a matrix to facilitate our computations.
factData <- as.matrix(prince[,1:4])
We then compute our correlation matrix.
factCor <- cor(factData)
factCov <- cov(factData)
The key assumptions of Factor Analysis are: 1. Adequate sample size 2. Sphericity 3. Positive determinant of the correlation matrix
Similar to the PCA performed above. As these assumptions were all test and satisfied (with some caveats), we proceed to perform the Factor Analysis.
factorAnalysis <- fa(factCor, nfactors = 3, n.obs = 98, rotate = "none")
print(factorAnalysis)
Factor Analysis using method = minres
Call: fa(r = factCor, nfactors = 3, n.obs = 98, rotate = "none")
Standardized loadings (pattern matrix) based upon correlation matrix
MR1 MR2 MR3 h2 u2 com
V1 0.30 0.38 0.06 0.24 0.7631 1.9
V2 0.99 0.04 0.05 0.99 0.0078 1.0
V3 0.59 -0.35 0.05 0.47 0.5337 1.6
V4 0.93 0.06 -0.10 0.88 0.1220 1.0
MR1 MR2 MR3
SS loadings 2.28 0.27 0.02
Proportion Var 0.57 0.07 0.00
Cumulative Var 0.57 0.64 0.64
Proportion Explained 0.89 0.11 0.01
Cumulative Proportion 0.89 0.99 1.00
Mean item complexity = 1.4
Test of the hypothesis that 3 factors are sufficient.
The degrees of freedom for the null model are 6 and the objective function was 2.41 with Chi Square of 228.97
The degrees of freedom for the model are -3 and the objective function was 0
The root mean square of the residuals (RMSR) is 0
The df corrected root mean square of the residuals is NA
The harmonic number of observations is 98 with the empirical chi square 0 with prob < NA
The total number of observations was 98 with Likelihood Chi Square = 0 with prob < NA
Tucker Lewis Index of factoring reliability = 1.028
Fit based upon off diagonal values = 1
Measures of factor score adequacy
MR1 MR2 MR3
Correlation of (regression) scores with factors 1.00 0.55 0.39
Multiple R square of scores with factors 0.99 0.31 0.15
Minimum correlation of possible factor scores 0.98 -0.39 -0.69
We construct a scree plot to aid with the selection of the number of factors. From this plot, we see that the eigenvalues drop precipitously after factor 1.
plot(factorAnalysis$values, type = "b", xlim = c(1, 10))
We therefore construct a new analysis with only a single factor.
factorAnalysis2 <- fa(factCor, nfactors = 1, n.obs = 98, rotate = "none", fm = "mle")
print(factorAnalysis2)
Factor Analysis using method = ml
Call: fa(r = factCor, nfactors = 1, n.obs = 98, rotate = "none", fm = "mle")
Standardized loadings (pattern matrix) based upon correlation matrix
ML1 h2 u2 com
V1 0.31 0.097 0.903 1
V2 1.00 0.995 0.005 1
V3 0.57 0.328 0.672 1
V4 0.92 0.853 0.147 1
ML1
SS loadings 2.27
Proportion Var 0.57
Mean item complexity = 1
Test of the hypothesis that 1 factor is sufficient.
The degrees of freedom for the null model are 6 and the objective function was 2.41 with Chi Square of 228.97
The degrees of freedom for the model are 2 and the objective function was 0.03
The root mean square of the residuals (RMSR) is 0.06
The df corrected root mean square of the residuals is 0.1
The harmonic number of observations is 98 with the empirical chi square 3.6 with prob < 0.17
The total number of observations was 98 with Likelihood Chi Square = 2.98 with prob < 0.23
Tucker Lewis Index of factoring reliability = 0.987
RMSEA index = 0.074 and the 90 % confidence intervals are 0 0.226
BIC = -6.19
Fit based upon off diagonal values = 0.99
Measures of factor score adequacy
ML1
Correlation of (regression) scores with factors 1.00
Multiple R square of scores with factors 1.00
Minimum correlation of possible factor scores 0.99
Using the above output, we test whether a single factor is sufficient to account for this model. The output indicates a \(\chi^{2}\) value of 228.97 with 6 \(df\). We calculate the p-value using the pchisq() function.
pchisq(228.97, df = 6, lower.tail = FALSE)
[1] 0.00000000000000000000000000000000000000000000012701
We obtain a p-value close to 0, therefore one factor is not insufficient.
We try a model with three factors. However, we achieve the same \(\chi^{2}\) value.
factorAnalysis3 <- fa(factCov, nfactors = 3, n.obs = 98, rotate = "none", fm = "mle")
print(factorAnalysis3)
Factor Analysis using method = ml
Call: fa(r = factCov, nfactors = 3, n.obs = 98, rotate = "none", fm = "mle")
Standardized loadings (pattern matrix) based upon correlation matrix
ML1 ML3 ML2 h2 u2 com
V1 0.31 -0.35 0.05 0.22 0.778 2.0
V2 0.98 0.00 0.19 0.99 0.005 1.1
V3 0.56 0.38 0.14 0.48 0.524 1.9
V4 0.98 0.00 -0.19 0.99 0.005 1.1
ML1 ML3 ML2
SS loadings 2.32 0.27 0.10
Proportion Var 0.58 0.07 0.02
Cumulative Var 0.58 0.65 0.67
Proportion Explained 0.86 0.10 0.04
Cumulative Proportion 0.86 0.96 1.00
Mean item complexity = 1.5
Test of the hypothesis that 3 factors are sufficient.
The degrees of freedom for the null model are 6 and the objective function was 2.41 with Chi Square of 228.97
The degrees of freedom for the model are -3 and the objective function was 0
The root mean square of the residuals (RMSR) is 0
The df corrected root mean square of the residuals is NA
The harmonic number of observations is 98 with the empirical chi square 0 with prob < NA
The total number of observations was 98 with Likelihood Chi Square = 0 with prob < NA
Tucker Lewis Index of factoring reliability = 1.028
Fit based upon off diagonal values = 1
Measures of factor score adequacy
ML1 ML3 ML2
Correlation of (regression) scores with factors 1.00 0.55 0.97
Multiple R square of scores with factors 1.00 0.31 0.94
Minimum correlation of possible factor scores 0.99 -0.39 0.87
Factor scores estimated using the regression method have correlations of
ML1 ML3 ML2
Correlation of scores with factors 0.06 0.05 0.04
Multiple R square of scores with factors 0.00 0.00 0.00
Minimum correlation of possible factor scores -0.99 -1.00 -1.00
We note that factor 1 is comprised of all four variables, similarly factor 2 is comprised of all four variables, while factor 3 is comprised of only variables 1 and 3.