EPI 288 Lecture 3: Principal Component Analysis

References

Principal component analysis

There are as many principal components as there are variables, but typically it is only the first few of them that explain important amounts of the total variation. (The R Book, page 810)

Principal components analysis (PCA) is a data reduction technique that transforms a larger number of correlated variables into a much smaller set of uncorrelated variables called principal components, while capturing as much information in the original variables as possible. (R in Action, pages 331 and 334)

In contrast, a similar but different method, exploratory factor analysis (EFA) is a collection of methods designed to uncover the latent structure in a given set of variables. (R in Action, page 332)

Steps (R in Action, page 333)

HADS Data

• Obtained from 1295 patients in four cancer trails in UK • Reference: Staquet MJ, Hays RD, Fayers PM. Quality of Life Assessment in Clinical Trials. Oxford University Press. 1998.

## Create a correlation matrix of 14 variables
dat.hads <- structure(c(1.0, .36, .53, .36, .55, .42, .51,
.31, .50, .34, .32, .37, .53, .31, .36, 1.0,
.29, .48, .26, .52, .40, .52, .22, .41, .17,
.58, .26, .43, .53, .29, 1.0, .40, .61, .38,
.42, .26, .57, .30, .32, .36, .59, .28, .36,
.48, .40, 1.0, .37, .61, .41, .34, .32, .37,
.20, .52, .33, .42, .55, .26, .61, .37, 1.0,
.39, .44, .27, .56, .27, .33, .32, .56, .27,
.42, .52, .38, .61, .39, 1.0, .44, .38, .35,
.45, .17, .54, .37, .44, .51, .40, .42, .41,
.44, .44, 1.0, .32, .41, .36, .35, .43, .45,
.42, .31, .52, .26, .34, .27, .38, .32, 1.0,
.21, .37, .18, .46, .25, .32, .50, .22, .57,
.32, .56, .35, .41, .21, 1.0, .26, .30, .30,
.58, .30, .34, .41, .30, .37, .27, .45, .36,
.37, .26, 1.0, .18, .48, .34, .36, .32, .17,
.32, .20, .33, .17, .35, .18, .30, .18, 1.0,
.18, .35, .21, .37, .58, .36, .52, .32, .54,
.43, .46, .30, .48, .18, 1.0, .31, .40, .53,
.26, .59, .33, .56, .37, .45, .25, .58, .34,
.35, .31, 1.0, .32, .31, .43, .28, .42, .27,
.44, .42, .32, .30, .36, .21, .40, .32, 1.0
), .Dim = c(14L, 14L), .Dimnames = list(c("q1", "q2", "q3", "q4",
"q5", "q6", "q7", "q8", "q9", "q10", "q11", "q12", "q13", "q14"
), c("q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q10",
"q11", "q12", "q13", "q14")))

Decide number of principal components to extract

Eigenvalues and scree plot

hads.eigen <- eigen(dat.hads)$values
plot(hads.eigen, type = "b")

plot of chunk unnamed-chunk-3

Parallel analysis and eigenvalue scree plot

This is suggestive of two principal components.

library(psych)
fa.parallel(x = dat.hads, n.obs = 1295, fa = "pc", n.iter = 100, show.legend = F)
Error: object 'fa.values.sim' not found

plot of chunk unnamed-chunk-4

Extract 2 principal components using pscych::principal()

pca.hads.no.rotate <- principal(r = dat.hads, nfactors = 2, n.obs = 1295, covar = F, rotate = "none")
pca.hads.no.rotate
Principal Components Analysis
Call: principal(r = dat.hads, nfactors = 2, rotate = "none", n.obs = 1295, 
    covar = F)
Standardized loadings (pattern matrix) based upon correlation matrix
     PC1   PC2   h2   u2
q1  0.71 -0.26 0.58 0.42
q2  0.65  0.48 0.65 0.35
q3  0.70 -0.39 0.64 0.36
q4  0.68  0.27 0.53 0.47
q5  0.69 -0.41 0.65 0.35
q6  0.72  0.29 0.60 0.40
q7  0.70 -0.05 0.50 0.50
q8  0.56  0.37 0.45 0.55
q9  0.65 -0.45 0.62 0.38
q10 0.60  0.28 0.44 0.56
q11 0.44 -0.32 0.29 0.71
q12 0.69  0.39 0.63 0.37
q13 0.69 -0.41 0.65 0.35
q14 0.60  0.25 0.42 0.58

                       PC1  PC2
SS loadings           5.96 1.68
Proportion Var        0.43 0.12
Cumulative Var        0.43 0.55
Proportion Explained  0.78 0.22
Cumulative Proportion 0.78 1.00

Test of the hypothesis that 2 components are sufficient.

The degrees of freedom for the null model are  91  and the objective function was  5.81
The degrees of freedom for the model are 64  and the objective function was  0.4 
The number of observations was  1295  with Chi Square =  520.4  with prob <  1e-72 

Fit based upon off diagonal values = 0.98

Rotate for more interpretability (purification of components)

pca.hads.varimax <- principal(r = dat.hads, nfactors = 2, n.obs = 1295, covar = F, rotate = "varimax")
pca.hads.varimax
Principal Components Analysis
Call: principal(r = dat.hads, nfactors = 2, rotate = "varimax", n.obs = 1295, 
    covar = F)
Standardized loadings (pattern matrix) based upon correlation matrix
     PC1  PC2   h2   u2
q1  0.33 0.68 0.58 0.42
q2  0.80 0.11 0.65 0.35
q3  0.23 0.77 0.64 0.36
q4  0.67 0.28 0.53 0.47
q5  0.21 0.78 0.65 0.35
q6  0.72 0.29 0.60 0.40
q7  0.47 0.53 0.50 0.50
q8  0.66 0.12 0.45 0.55
q9  0.16 0.77 0.62 0.38
q10 0.63 0.21 0.44 0.56
q11 0.10 0.53 0.29 0.71
q12 0.77 0.20 0.63 0.37
q13 0.21 0.78 0.65 0.35
q14 0.60 0.24 0.42 0.58

                       PC1  PC2
SS loadings           3.89 3.76
Proportion Var        0.28 0.27
Cumulative Var        0.28 0.55
Proportion Explained  0.51 0.49
Cumulative Proportion 0.51 1.00

Test of the hypothesis that 2 components are sufficient.

The degrees of freedom for the null model are  91  and the objective function was  5.81
The degrees of freedom for the model are 64  and the objective function was  0.4 
The number of observations was  1295  with Chi Square =  520.4  with prob <  1e-72 

Fit based upon off diagonal values = 0.98

The varimax roratation is an orthogonal (avoids correlation) rotation that attempts to purify the component by limiting number of original variables that are highly correlated with each component.

Now PC1 is correlated with even number questions, and PC2 is correlated with odd number questions.