There are as many principal components as there are variables, but typically it is only the first few of them that explain important amounts of the total variation. (The R Book, page 810)
Principal components analysis (PCA) is a data reduction technique that transforms a larger number of correlated variables into a much smaller set of uncorrelated variables called principal components, while capturing as much information in the original variables as possible. (R in Action, pages 331 and 334)
In contrast, a similar but different method, exploratory factor analysis (EFA) is a collection of methods designed to uncover the latent structure in a given set of variables. (R in Action, page 332)
Steps (R in Action, page 333)
• Obtained from 1295 patients in four cancer trails in UK • Reference: Staquet MJ, Hays RD, Fayers PM. Quality of Life Assessment in Clinical Trials. Oxford University Press. 1998.
## Create a correlation matrix of 14 variables
dat.hads <- structure(c(1.0, .36, .53, .36, .55, .42, .51,
.31, .50, .34, .32, .37, .53, .31, .36, 1.0,
.29, .48, .26, .52, .40, .52, .22, .41, .17,
.58, .26, .43, .53, .29, 1.0, .40, .61, .38,
.42, .26, .57, .30, .32, .36, .59, .28, .36,
.48, .40, 1.0, .37, .61, .41, .34, .32, .37,
.20, .52, .33, .42, .55, .26, .61, .37, 1.0,
.39, .44, .27, .56, .27, .33, .32, .56, .27,
.42, .52, .38, .61, .39, 1.0, .44, .38, .35,
.45, .17, .54, .37, .44, .51, .40, .42, .41,
.44, .44, 1.0, .32, .41, .36, .35, .43, .45,
.42, .31, .52, .26, .34, .27, .38, .32, 1.0,
.21, .37, .18, .46, .25, .32, .50, .22, .57,
.32, .56, .35, .41, .21, 1.0, .26, .30, .30,
.58, .30, .34, .41, .30, .37, .27, .45, .36,
.37, .26, 1.0, .18, .48, .34, .36, .32, .17,
.32, .20, .33, .17, .35, .18, .30, .18, 1.0,
.18, .35, .21, .37, .58, .36, .52, .32, .54,
.43, .46, .30, .48, .18, 1.0, .31, .40, .53,
.26, .59, .33, .56, .37, .45, .25, .58, .34,
.35, .31, 1.0, .32, .31, .43, .28, .42, .27,
.44, .42, .32, .30, .36, .21, .40, .32, 1.0
), .Dim = c(14L, 14L), .Dimnames = list(c("q1", "q2", "q3", "q4",
"q5", "q6", "q7", "q8", "q9", "q10", "q11", "q12", "q13", "q14"
), c("q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q10",
"q11", "q12", "q13", "q14")))
Decide number of principal components to extract
Eigenvalues and scree plot
hads.eigen <- eigen(dat.hads)$values
plot(hads.eigen, type = "b")
Parallel analysis and eigenvalue scree plot
This is suggestive of two principal components.
library(psych)
fa.parallel(x = dat.hads, n.obs = 1295, fa = "pc", n.iter = 100, show.legend = F)
Error: object 'fa.values.sim' not found
Extract 2 principal components using pscych::principal()
pca.hads.no.rotate <- principal(r = dat.hads, nfactors = 2, n.obs = 1295, covar = F, rotate = "none")
pca.hads.no.rotate
Principal Components Analysis
Call: principal(r = dat.hads, nfactors = 2, rotate = "none", n.obs = 1295,
covar = F)
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 h2 u2
q1 0.71 -0.26 0.58 0.42
q2 0.65 0.48 0.65 0.35
q3 0.70 -0.39 0.64 0.36
q4 0.68 0.27 0.53 0.47
q5 0.69 -0.41 0.65 0.35
q6 0.72 0.29 0.60 0.40
q7 0.70 -0.05 0.50 0.50
q8 0.56 0.37 0.45 0.55
q9 0.65 -0.45 0.62 0.38
q10 0.60 0.28 0.44 0.56
q11 0.44 -0.32 0.29 0.71
q12 0.69 0.39 0.63 0.37
q13 0.69 -0.41 0.65 0.35
q14 0.60 0.25 0.42 0.58
PC1 PC2
SS loadings 5.96 1.68
Proportion Var 0.43 0.12
Cumulative Var 0.43 0.55
Proportion Explained 0.78 0.22
Cumulative Proportion 0.78 1.00
Test of the hypothesis that 2 components are sufficient.
The degrees of freedom for the null model are 91 and the objective function was 5.81
The degrees of freedom for the model are 64 and the objective function was 0.4
The number of observations was 1295 with Chi Square = 520.4 with prob < 1e-72
Fit based upon off diagonal values = 0.98
Rotate for more interpretability (purification of components)
pca.hads.varimax <- principal(r = dat.hads, nfactors = 2, n.obs = 1295, covar = F, rotate = "varimax")
pca.hads.varimax
Principal Components Analysis
Call: principal(r = dat.hads, nfactors = 2, rotate = "varimax", n.obs = 1295,
covar = F)
Standardized loadings (pattern matrix) based upon correlation matrix
PC1 PC2 h2 u2
q1 0.33 0.68 0.58 0.42
q2 0.80 0.11 0.65 0.35
q3 0.23 0.77 0.64 0.36
q4 0.67 0.28 0.53 0.47
q5 0.21 0.78 0.65 0.35
q6 0.72 0.29 0.60 0.40
q7 0.47 0.53 0.50 0.50
q8 0.66 0.12 0.45 0.55
q9 0.16 0.77 0.62 0.38
q10 0.63 0.21 0.44 0.56
q11 0.10 0.53 0.29 0.71
q12 0.77 0.20 0.63 0.37
q13 0.21 0.78 0.65 0.35
q14 0.60 0.24 0.42 0.58
PC1 PC2
SS loadings 3.89 3.76
Proportion Var 0.28 0.27
Cumulative Var 0.28 0.55
Proportion Explained 0.51 0.49
Cumulative Proportion 0.51 1.00
Test of the hypothesis that 2 components are sufficient.
The degrees of freedom for the null model are 91 and the objective function was 5.81
The degrees of freedom for the model are 64 and the objective function was 0.4
The number of observations was 1295 with Chi Square = 520.4 with prob < 1e-72
Fit based upon off diagonal values = 0.98
The varimax roratation is an orthogonal (avoids correlation) rotation that attempts to purify the component by limiting number of original variables that are highly correlated with each component.
Now PC1 is correlated with even number questions, and PC2 is correlated with odd number questions.