Welcome to this R demo session! Here, I will demonstrate how to use R to conduct principal component analysis (PCA).
Our first example data originates from the National Institute of Child Health and Human Development (NICHD). It includes five subtests designed to measure different aspects of intelligence:
Our goal is to create a principal component that reflects general intelligence. We will then use this component to predict outcomes based on a set of social skills variables.
#load the data
library(readxl)
example1<-read_excel("pca_example1.xlsx")
#examine a snapshot of the data
head(example1)
## # A tibble: 6 × 9
## picvoc letword passcomp calc appprobs peercomp coop assert selfcont
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 509 498 496 503 508 17 18 11 17
## 2 512 514 508 534 530 13 14 13 11
## 3 512 524 510 514 511 15 17 11 16
## 4 486 488 493 503 499 9 2 14 5
## 5 512 528 521 523 516 18 19 15 17
## 6 505 520 510 528 525 18 17 19 18
#data summary
summary(example1)
## picvoc letword passcomp calc appprobs
## Min. :476 Min. :460.0 Min. :466.0 Min. :469.0 Min. :477.0
## 1st Qu.:501 1st Qu.:504.0 1st Qu.:502.0 1st Qu.:506.0 1st Qu.:508.0
## Median :509 Median :514.0 Median :508.0 Median :514.0 Median :514.0
## Mean :508 Mean :513.2 Mean :507.6 Mean :513.8 Mean :512.4
## 3rd Qu.:516 3rd Qu.:524.0 3rd Qu.:516.0 3rd Qu.:523.0 3rd Qu.:519.0
## Max. :542 Max. :557.0 Max. :542.0 Max. :551.0 Max. :544.0
## peercomp coop assert selfcont
## Min. : 3.00 Min. : 2.00 Min. : 1.00 Min. : 1.00
## 1st Qu.:14.00 1st Qu.:15.00 1st Qu.:11.00 1st Qu.:13.00
## Median :16.00 Median :18.00 Median :14.00 Median :16.00
## Mean :15.68 Mean :16.65 Mean :13.34 Mean :15.53
## 3rd Qu.:18.00 3rd Qu.:19.00 3rd Qu.:16.25 3rd Qu.:18.00
## Max. :20.00 Max. :20.00 Max. :20.00 Max. :20.00
Now let’s examine the descriptive information.
#a nice simple way to get lots of descriptive information
library(psych)
describe(example1[,1:5])
## vars n mean sd median trimmed mad min max range skew
## picvoc 1 652 508.04 10.81 509 508.06 11.86 476 542 66 -0.01
## letword 2 652 513.21 15.05 514 513.59 14.83 460 557 97 -0.25
## passcomp 3 652 507.60 10.40 508 507.89 8.90 466 542 76 -0.34
## calc 4 652 513.78 11.70 514 513.82 11.86 469 551 82 -0.05
## appprobs 5 652 512.38 10.10 514 512.64 8.90 477 544 67 -0.26
## kurtosis se
## picvoc 0.07 0.42
## letword 0.22 0.59
## passcomp 0.66 0.41
## calc 0.25 0.46
## appprobs 0.50 0.40
#correlations
rexample1<-corr.test(example1[,1:5])$r
rexample1
## picvoc letword passcomp calc appprobs
## picvoc 1.0000000 0.4786041 0.5481996 0.3145295 0.4377038
## letword 0.4786041 1.0000000 0.5985786 0.4080867 0.4518319
## passcomp 0.5481996 0.5985786 1.0000000 0.3961540 0.4932971
## calc 0.3145295 0.4080867 0.3961540 1.0000000 0.6308794
## appprobs 0.4377038 0.4518319 0.4932971 0.6308794 1.0000000
We can see all pairs of variables have moderate correlations so this is good data for using PCA
#eigenvalues
eigen(rexample1)
## eigen() decomposition
## $values
## [1] 2.9087833 0.8255286 0.5287150 0.3921322 0.3448408
##
## $vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.4247703 -0.4471545 0.7044090 0.3315458 -0.1162257
## [2,] -0.4538706 -0.2816093 -0.6484062 0.4945009 0.2230158
## [3,] -0.4713585 -0.3309797 -0.1900901 -0.7443628 -0.2793984
## [4,] -0.4185413 0.6510051 -0.0358461 0.2087298 -0.5967935
## [5,] -0.4649903 0.4328901 0.2143789 -0.2188659 0.7088932
By Kaiser rule, we would extract 1 component
#scree plot
fa.parallel(rexample1,fa="pc",n.obs=652) #using eigenvalues
## Parallel analysis suggests that the number of factors = NA and the number of components = 1
Note that this can be done with the correlation matrix or with the original data which gives a second version of the parallel analysis using resampling
#parallel analysis
fa.parallel(example1[,1:5],fa="pc")
## Parallel analysis suggests that the number of factors = NA and the number of components = 1
It seems that everything points to 1 component being enough to provide a good summary although we’d need all 5 to completely exhaust all the variance in these variables
output1 <- principal(example1[,1:5],nfactors=1,scores = TRUE)
output1
## Principal Components Analysis
## Call: principal(r = example1[, 1:5], nfactors = 1, scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 h2 u2 com
## picvoc 0.72 0.52 0.48 1
## letword 0.77 0.60 0.40 1
## passcomp 0.80 0.65 0.35 1
## calc 0.71 0.51 0.49 1
## appprobs 0.79 0.63 0.37 1
##
## PC1
## SS loadings 2.91
## Proportion Var 0.58
##
## Mean item complexity = 1
## Test of the hypothesis that 1 component is sufficient.
##
## The root mean square of the residuals (RMSR) is 0.13
## with the empirical chi square 224.25 with prob < 1.8e-46
##
## Fit based upon off diagonal values = 0.93
Proportion Var says that 58% of the total variance is accounted for by the single component. The remaining 42% is in the components we decided weren’t important enough to bother with
The PC1 column is the loadings:
h2 is communality:
u2 is uniqueness:
Do not pay attention to statistical test of sufficiency of components. It isn’t really relevant. I don’t think it should even be printed out
Let’s now combine component scores with original data
#combine component scores with original data
example1$PC1 <- output1$scores[,1]
head(example1)
## # A tibble: 6 × 10
## picvoc letword passcomp calc appprobs peercomp coop assert selfcont PC1
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 509 498 496 503 508 17 18 11 17 -0.900
## 2 512 514 508 534 530 13 14 13 11 1.02
## 3 512 524 510 514 511 15 17 11 16 0.313
## 4 486 488 493 503 499 9 2 14 5 -1.93
## 5 512 528 521 523 516 18 19 15 17 1.00
## 6 505 520 510 528 525 18 17 19 18 0.753
Note that component scores are automatically standardized in the way they are created
mean(example1$PC1) #very close to 0
## [1] 6.015978e-16
SD(example1$PC1) #SD = 1
## [1] 1
Now, let’s get to the point: why should we do this?
output_picvoc <- lm(picvoc~peercomp + coop + assert + selfcont,data=example1)
summary(output_picvoc)
##
## Call:
## lm(formula = picvoc ~ peercomp + coop + assert + selfcont, data = example1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.006 -7.261 0.017 7.635 35.058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 502.3714 2.2246 225.826 < 2e-16 ***
## peercomp -0.8702 0.4111 -2.117 0.034678 *
## coop 0.2486 0.1524 1.631 0.103292
## assert 0.6355 0.1831 3.471 0.000552 ***
## selfcont 0.4314 0.3131 1.378 0.168730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.69 on 647 degrees of freedom
## Multiple R-squared: 0.02754, Adjusted R-squared: 0.02153
## F-statistic: 4.582 on 4 and 647 DF, p-value: 0.00118
output_letword <- lm(letword~peercomp + coop + assert + selfcont,data=example1)
summary(output_letword)
##
## Call:
## lm(formula = letword ~ peercomp + coop + assert + selfcont, data = example1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.908 -10.178 0.257 10.439 43.523
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 506.7609 3.1008 163.429 <2e-16 ***
## peercomp -1.3426 0.5731 -2.343 0.0194 *
## coop 0.6814 0.2124 3.209 0.0014 **
## assert 0.6076 0.2552 2.381 0.0176 *
## selfcont 0.5189 0.4365 1.189 0.2349
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.91 on 647 degrees of freedom
## Multiple R-squared: 0.02533, Adjusted R-squared: 0.01931
## F-statistic: 4.204 on 4 and 647 DF, p-value: 0.002282
output_passcomp <- lm(passcomp~peercomp + coop + assert + selfcont,data=example1)
summary(output_passcomp)
##
## Call:
## lm(formula = passcomp ~ peercomp + coop + assert + selfcont,
## data = example1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.916 -6.171 0.541 6.813 33.798
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 498.1239 2.1309 233.759 < 2e-16 ***
## peercomp -0.2432 0.3938 -0.618 0.53702
## coop 0.4644 0.1459 3.182 0.00153 **
## assert 0.2369 0.1754 1.351 0.17717
## selfcont 0.1544 0.2999 0.515 0.60685
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.24 on 647 degrees of freedom
## Multiple R-squared: 0.03494, Adjusted R-squared: 0.02897
## F-statistic: 5.856 on 4 and 647 DF, p-value: 0.0001241
output_calc <- lm(calc~peercomp + coop + assert + selfcont,data=example1)
summary(output_calc)
##
## Call:
## lm(formula = calc ~ peercomp + coop + assert + selfcont, data = example1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.362 -7.669 -0.488 7.417 38.125
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 503.0109 2.3710 212.150 < 2e-16 ***
## peercomp -0.2471 0.4382 -0.564 0.573
## coop 0.8966 0.1624 5.522 4.86e-08 ***
## assert 0.2721 0.1951 1.395 0.164
## selfcont -0.2522 0.3337 -0.756 0.450
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.4 on 647 degrees of freedom
## Multiple R-squared: 0.0563, Adjusted R-squared: 0.05046
## F-statistic: 9.649 on 4 and 647 DF, p-value: 1.389e-07
output_appprobs <- lm(appprobs~peercomp + coop + assert + selfcont,data=example1)
summary(output_appprobs)
##
## Call:
## lm(formula = appprobs ~ peercomp + coop + assert + selfcont,
## data = example1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.131 -5.966 0.455 6.210 32.106
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 504.217379 2.053940 245.488 < 2e-16 ***
## peercomp -0.549537 0.379594 -1.448 0.1482
## coop 0.734154 0.140666 5.219 2.42e-07 ***
## assert 0.344865 0.169037 2.040 0.0417 *
## selfcont -0.003132 0.289108 -0.011 0.9914
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.873 on 647 degrees of freedom
## Multiple R-squared: 0.05106, Adjusted R-squared: 0.04519
## F-statistic: 8.703 on 4 and 647 DF, p-value: 7.599e-07
#notice here I used principal component scores as the DV
output_pc <- lm(PC1~peercomp + coop + assert + selfcont,data=example1)
summary(output_pc)
##
## Call:
## lm(formula = PC1 ~ peercomp + coop + assert + selfcont, data = example1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6111 -0.6462 0.0561 0.6821 3.3204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.94275 0.20257 -4.654 3.95e-06 ***
## peercomp -0.07026 0.03744 -1.877 0.06099 .
## coop 0.06874 0.01387 4.955 9.25e-07 ***
## assert 0.04670 0.01667 2.801 0.00525 **
## selfcont 0.01784 0.02851 0.626 0.53168
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9738 on 647 degrees of freedom
## Multiple R-squared: 0.05756, Adjusted R-squared: 0.05174
## F-statistic: 9.88 on 4 and 647 DF, p-value: 9.188e-08
Like magic, I just turned the challenge of summarizing five regressions into one regression using a summary of the variables.
The second example is from multiple studies by Salthouse, which includes six variables that assess cognitive abilities:
The objective is to examine the relationship between these cognitive abilities and age.
#load the data
example2<-read_excel("PCA_example2.xlsx")
#examine a snapshot of the data
head(example2)
## # A tibble: 6 × 8
## age syn ant raven sparel letcom patcom dummy
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 74 8 10 0.06 7 5.5 10 0
## 2 31 5 5 0.44 15 13 20 0
## 3 42 3 3 0.06 2 6 10 0
## 4 69 10 10 0.5 8 11.5 15 0
## 5 31 5 2 0.5 9 11.5 18 0
## 6 66 6 5 0.13 3 8 15 0
#data summary
summary(example2)
## age syn ant raven
## Min. :18.00 Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.:36.00 1st Qu.: 6.000 1st Qu.: 5.000 1st Qu.:0.3100
## Median :49.00 Median : 9.000 Median : 8.000 Median :0.5000
## Mean :49.04 Mean : 7.583 Mean : 7.244 Mean :0.4826
## 3rd Qu.:62.00 3rd Qu.:10.000 3rd Qu.:10.000 3rd Qu.:0.6300
## Max. :91.00 Max. :10.000 Max. :10.000 Max. :1.0000
## sparel letcom patcom dummy
## Min. : 0.000 Min. :-6.50 Min. : 0.00 Min. :0
## 1st Qu.: 5.000 1st Qu.: 9.00 1st Qu.:15.00 1st Qu.:0
## Median : 9.000 Median :10.50 Median :18.00 Median :0
## Mean : 9.365 Mean :10.59 Mean :17.99 Mean :0
## 3rd Qu.:13.000 3rd Qu.:12.50 3rd Qu.:21.00 3rd Qu.:0
## Max. :20.000 Max. :20.50 Max. :30.00 Max. :0
Now let’s examine the relationship between age and these cognitive abilities using linear regression
output_syn <- lm(syn ~ age, data=example2)
summary(output_syn) #positive coefficient
##
## Call:
## lm(formula = syn ~ age, data = example2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.610 -1.424 0.832 1.885 4.075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.963093 0.279684 17.745 <2e-16 ***
## age 0.053415 0.005398 9.896 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.592 on 822 degrees of freedom
## Multiple R-squared: 0.1064, Adjusted R-squared: 0.1054
## F-statistic: 97.92 on 1 and 822 DF, p-value: < 2.2e-16
output_ant <- lm(ant ~ age, data=example2)
summary(output_ant) #positive coefficient
##
## Call:
## lm(formula = ant ~ age, data = example2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4902 -1.6726 0.8438 2.1450 4.0487
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.133690 0.297512 17.255 < 2e-16 ***
## age 0.043032 0.005742 7.494 1.72e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.757 on 822 degrees of freedom
## Multiple R-squared: 0.06396, Adjusted R-squared: 0.06282
## F-statistic: 56.16 on 1 and 822 DF, p-value: 1.725e-13
output_raven <- lm(raven ~ age, data=example2)
summary(output_raven) #negative coefficient
##
## Call:
## lm(formula = raven ~ age, data = example2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6413 -0.1127 0.0167 0.1299 0.4317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7814054 0.0192848 40.52 <2e-16 ***
## age -0.0060933 0.0003722 -16.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1787 on 822 degrees of freedom
## Multiple R-squared: 0.2459, Adjusted R-squared: 0.245
## F-statistic: 268 on 1 and 822 DF, p-value: < 2.2e-16
output_sparel <- lm(sparel ~ age, data=example2)
summary(output_sparel) #negative coefficient
##
## Call:
## lm(formula = sparel ~ age, data = example2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2801 -3.8185 -0.4324 3.6027 12.6686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.12371 0.52415 26.946 <2e-16 ***
## age -0.09703 0.01012 -9.592 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.858 on 822 degrees of freedom
## Multiple R-squared: 0.1007, Adjusted R-squared: 0.09957
## F-statistic: 92.01 on 1 and 822 DF, p-value: < 2.2e-16
output_letcom <- lm(letcom ~ age, data=example2)
summary(output_letcom) #negative coefficient
##
## Call:
## lm(formula = letcom ~ age, data = example2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9039 -1.4701 0.1423 1.5315 8.4702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.841266 0.281272 49.21 <2e-16 ***
## age -0.066230 0.005429 -12.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.607 on 822 degrees of freedom
## Multiple R-squared: 0.1533, Adjusted R-squared: 0.1523
## F-statistic: 148.8 on 1 and 822 DF, p-value: < 2.2e-16
output_patcom <- lm(patcom ~ age, data=example2)
summary(output_patcom) #negative coefficient
##
## Call:
## lm(formula = patcom ~ age, data = example2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.9815 -2.0445 0.0058 2.1287 10.7879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.485922 0.379427 64.53 <2e-16 ***
## age -0.132484 0.007323 -18.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.516 on 822 degrees of freedom
## Multiple R-squared: 0.2848, Adjusted R-squared: 0.2839
## F-statistic: 327.3 on 1 and 822 DF, p-value: < 2.2e-16
The results from regression analyses seems a lot harder to summarize
Let’s see what PCA says
fa.parallel(example2[,2:7],fa="pc")
## Parallel analysis suggests that the number of factors = NA and the number of components = 2
Ah, we need 2 components (#2 components are needed to summarize these variables)
output2 <- principal(example2[,2:7],nfactors=2,scores = TRUE,rotate="none")
output2
## Principal Components Analysis
## Call: principal(r = example2[, 2:7], nfactors = 2, rotate = "none",
## scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 h2 u2 com
## syn 0.48 0.80 0.87 0.13 1.6
## ant 0.56 0.73 0.85 0.15 1.9
## raven 0.81 -0.20 0.70 0.30 1.1
## sparel 0.78 -0.06 0.62 0.38 1.0
## letcom 0.68 -0.38 0.60 0.40 1.6
## patcom 0.70 -0.47 0.71 0.29 1.7
##
## PC1 PC2
## SS loadings 2.77 1.58
## Proportion Var 0.46 0.26
## Cumulative Var 0.46 0.73
## Proportion Explained 0.64 0.36
## Cumulative Proportion 0.64 1.00
##
## Mean item complexity = 1.5
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.1
## with the empirical chi square 249.25 with prob < 9.5e-53
##
## Fit based upon off diagonal values = 0.94
First component accounts for 46% of variance. Remember that PCA is in order of magnitude. This is the largest amount of shared variance that any linear combination can yield
Second accounts for 26% of variance
Cumulative: both together account for 73% of the variance in the 6 variables
Loadings for first principal component are all positive and moderate to large. First component is summary of all variables.
Loadings for second principal component are a mixture of positive and negative. This is seriously hard to understand
This pattern is quite common (i.e., all positive loadings on first component, mixture of positive and negative on other components)
This is where rotation may be helpful.
output3 <- principal(example2[,2:7],nfactors=2,scores = TRUE,rotate="varimax")
output3
## Principal Components Analysis
## Call: principal(r = example2[, 2:7], nfactors = 2, rotate = "varimax",
## scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 h2 u2 com
## syn 0.04 0.93 0.87 0.13 1.0
## ant 0.14 0.91 0.85 0.15 1.0
## raven 0.81 0.22 0.70 0.30 1.1
## sparel 0.72 0.32 0.62 0.38 1.4
## letcom 0.78 -0.01 0.60 0.40 1.0
## patcom 0.84 -0.08 0.71 0.29 1.0
##
## RC1 RC2
## SS loadings 2.49 1.86
## Proportion Var 0.42 0.31
## Cumulative Var 0.42 0.73
## Proportion Explained 0.57 0.43
## Cumulative Proportion 0.57 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.1
## with the empirical chi square 249.25 with prob < 9.5e-53
##
## Fit based upon off diagonal values = 0.94
That’s better. Now we have much better interpretation
First component is raven
, sparel
,
letcom
, patcom
Second component is syn
, ant
This is consistent with theory, which says syn and ant are crystallized skills and the other are fluid skills
This rotation has just redistributed variance to increase interpretation
output4 <- principal(example2[,2:7],nfactors=2,scores = TRUE,rotate="promax")
output4
## Principal Components Analysis
## Call: principal(r = example2[, 2:7], nfactors = 2, rotate = "promax",
## scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
## RC1 RC2 h2 u2 com
## syn -0.08 0.95 0.87 0.13 1.0
## ant 0.03 0.92 0.85 0.15 1.0
## raven 0.80 0.12 0.70 0.30 1.0
## sparel 0.70 0.24 0.62 0.38 1.2
## letcom 0.80 -0.11 0.60 0.40 1.0
## patcom 0.87 -0.19 0.71 0.29 1.1
##
## RC1 RC2
## SS loadings 2.51 1.85
## Proportion Var 0.42 0.31
## Cumulative Var 0.42 0.73
## Proportion Explained 0.58 0.42
## Cumulative Proportion 0.58 1.00
##
## With component correlations of
## RC1 RC2
## RC1 1.00 0.25
## RC2 0.25 1.00
##
## Mean item complexity = 1.1
## Test of the hypothesis that 2 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.1
## with the empirical chi square 249.25 with prob < 9.5e-53
##
## Fit based upon off diagonal values = 0.94
The total variance explained is the same, although each component accounts for different amounts and they overlap because the component correlation is .25
Square that to get .0625, so they only share 6.25% of variance the change is small enough that you can only see it in SS loadings
If correlation were larger, proportion var can look different but components together will still account for the same amount of variance
Here we can see that rotation changes the way variance is shared, but not the total amount of shared variance
That the psych
package principal
function
outputs only the pattern matrix, not the structure matrix. They are
mathematically related, but the structure matrix would be more
appropriate
Revelle, the author of psych
, comes from a factor
analytic tradition. Kudos to him for having a separate function for PCA,
but I wish there were some way to output the structure matrix
The component scores are created (correctly) from the structure matrix, but the interpretation of the components is less precise without the structure matrix
The two matrices will generally be similar
As we’ll in factor analysis, the variables are linear combinations of the factors and the pattern matrix is the coefficients for those linear combinations
For PCA, the pattern matrix is wrong- we need the coefficients for the linear combinations of variables that create the components- the opposite direction
There is a way to calculate the structure matrix with an R script but it’s not worth the effort
We’ll just live with the wrong matrix
#combine components with original data
example2$PC1 <- output4$scores[,1]
example2$PC2 <- output4$scores[,2]
#check correlation
cor(example2$PC1,example2$PC2)
## [1] 0.2461769
Matches what the analysis said
This happens with PCA, but not factor analysis
output_PC1 <- lm(PC1 ~ age, data=example2)
summary(output_PC1)
##
## Call:
## lm(formula = PC1 ~ age, data = example2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2138 -0.5493 0.0215 0.5684 2.5042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.628212 0.089760 18.14 <2e-16 ***
## age -0.033202 0.001732 -19.17 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8319 on 822 degrees of freedom
## Multiple R-squared: 0.3089, Adjusted R-squared: 0.308
## F-statistic: 367.3 on 1 and 822 DF, p-value: < 2.2e-16
output_PC2 <- lm(PC2 ~ age, data=example2)
summary(output_PC2)
##
## Call:
## lm(formula = PC2 ~ age, data = example2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1209 -0.5769 0.2673 0.7216 1.5826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.833832 0.103504 -8.056 2.76e-15 ***
## age 0.017003 0.001998 8.512 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9592 on 822 degrees of freedom
## Multiple R-squared: 0.081, Adjusted R-squared: 0.07988
## F-statistic: 72.45 on 1 and 822 DF, p-value: < 2.2e-16
These results are easier to understand:
Yes, a little harder to interpret PC rather than original variables but at least it’s only 2 regressions instead of 6