Rmarkdown: Principal Component Analysis

Introduction

Welcome to this R demo session! Here, I will demonstrate how to use R to conduct principal component analysis (PCA).

Example 1

Our first example data originates from the National Institute of Child Health and Human Development (NICHD). It includes five subtests designed to measure different aspects of intelligence:

Picture Vocabulary
Letter Word Identification
Passage Comprehension
Calculation
Applied Problems

Our goal is to create a principal component that reflects general intelligence. We will then use this component to predict outcomes based on a set of social skills variables.

#load the data
library(readxl)
example1<-read_excel("pca_example1.xlsx")

#examine a snapshot of the data
head(example1)

## # A tibble: 6 × 9
##   picvoc letword passcomp  calc appprobs peercomp  coop assert selfcont
##    <dbl>   <dbl>    <dbl> <dbl>    <dbl>    <dbl> <dbl>  <dbl>    <dbl>
## 1    509     498      496   503      508       17    18     11       17
## 2    512     514      508   534      530       13    14     13       11
## 3    512     524      510   514      511       15    17     11       16
## 4    486     488      493   503      499        9     2     14        5
## 5    512     528      521   523      516       18    19     15       17
## 6    505     520      510   528      525       18    17     19       18

#data summary
summary(example1)

##      picvoc       letword         passcomp          calc          appprobs    
##  Min.   :476   Min.   :460.0   Min.   :466.0   Min.   :469.0   Min.   :477.0  
##  1st Qu.:501   1st Qu.:504.0   1st Qu.:502.0   1st Qu.:506.0   1st Qu.:508.0  
##  Median :509   Median :514.0   Median :508.0   Median :514.0   Median :514.0  
##  Mean   :508   Mean   :513.2   Mean   :507.6   Mean   :513.8   Mean   :512.4  
##  3rd Qu.:516   3rd Qu.:524.0   3rd Qu.:516.0   3rd Qu.:523.0   3rd Qu.:519.0  
##  Max.   :542   Max.   :557.0   Max.   :542.0   Max.   :551.0   Max.   :544.0  
##     peercomp          coop           assert         selfcont    
##  Min.   : 3.00   Min.   : 2.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:14.00   1st Qu.:15.00   1st Qu.:11.00   1st Qu.:13.00  
##  Median :16.00   Median :18.00   Median :14.00   Median :16.00  
##  Mean   :15.68   Mean   :16.65   Mean   :13.34   Mean   :15.53  
##  3rd Qu.:18.00   3rd Qu.:19.00   3rd Qu.:16.25   3rd Qu.:18.00  
##  Max.   :20.00   Max.   :20.00   Max.   :20.00   Max.   :20.00

Now let’s examine the descriptive information.

#a nice simple way to get lots of descriptive information
library(psych)
describe(example1[,1:5])

##          vars   n   mean    sd median trimmed   mad min max range  skew
## picvoc      1 652 508.04 10.81    509  508.06 11.86 476 542    66 -0.01
## letword     2 652 513.21 15.05    514  513.59 14.83 460 557    97 -0.25
## passcomp    3 652 507.60 10.40    508  507.89  8.90 466 542    76 -0.34
## calc        4 652 513.78 11.70    514  513.82 11.86 469 551    82 -0.05
## appprobs    5 652 512.38 10.10    514  512.64  8.90 477 544    67 -0.26
##          kurtosis   se
## picvoc       0.07 0.42
## letword      0.22 0.59
## passcomp     0.66 0.41
## calc         0.25 0.46
## appprobs     0.50 0.40

#correlations
rexample1<-corr.test(example1[,1:5])$r
rexample1

##             picvoc   letword  passcomp      calc  appprobs
## picvoc   1.0000000 0.4786041 0.5481996 0.3145295 0.4377038
## letword  0.4786041 1.0000000 0.5985786 0.4080867 0.4518319
## passcomp 0.5481996 0.5985786 1.0000000 0.3961540 0.4932971
## calc     0.3145295 0.4080867 0.3961540 1.0000000 0.6308794
## appprobs 0.4377038 0.4518319 0.4932971 0.6308794 1.0000000

We can see all pairs of variables have moderate correlations so this is good data for using PCA

#eigenvalues
eigen(rexample1)

## eigen() decomposition
## $values
## [1] 2.9087833 0.8255286 0.5287150 0.3921322 0.3448408
## 
## $vectors
##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,] -0.4247703 -0.4471545  0.7044090  0.3315458 -0.1162257
## [2,] -0.4538706 -0.2816093 -0.6484062  0.4945009  0.2230158
## [3,] -0.4713585 -0.3309797 -0.1900901 -0.7443628 -0.2793984
## [4,] -0.4185413  0.6510051 -0.0358461  0.2087298 -0.5967935
## [5,] -0.4649903  0.4328901  0.2143789 -0.2188659  0.7088932

By Kaiser rule, we would extract 1 component

#scree plot 
fa.parallel(rexample1,fa="pc",n.obs=652) #using eigenvalues

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  1

Note that this can be done with the correlation matrix or with the original data which gives a second version of the parallel analysis using resampling

#parallel analysis
fa.parallel(example1[,1:5],fa="pc")

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  1

It seems that everything points to 1 component being enough to provide a good summary although we’d need all 5 to completely exhaust all the variance in these variables

output1 <- principal(example1[,1:5],nfactors=1,scores = TRUE)
output1

## Principal Components Analysis
## Call: principal(r = example1[, 1:5], nfactors = 1, scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##           PC1   h2   u2 com
## picvoc   0.72 0.52 0.48   1
## letword  0.77 0.60 0.40   1
## passcomp 0.80 0.65 0.35   1
## calc     0.71 0.51 0.49   1
## appprobs 0.79 0.63 0.37   1
## 
##                 PC1
## SS loadings    2.91
## Proportion Var 0.58
## 
## Mean item complexity =  1
## Test of the hypothesis that 1 component is sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.13 
##  with the empirical chi square  224.25  with prob <  1.8e-46 
## 
## Fit based upon off diagonal values = 0.93

Proportion Var says that 58% of the total variance is accounted for by the single component. The remaining 42% is in the components we decided weren’t important enough to bother with

The PC1 column is the loadings:

All loadings are high and equal
They aren’t quite the coefficients for the linear combination to create the components
But most common output used to interpret components
Here, components are essentially an equally weighted sum of the (standardized) variables (standardized because using correlation matrix)

h2 is communality:

Proportion of variance in that variable accounted for by
This is more important with factor analysis
Not particularly important for PCA

u2 is uniqueness:

The proportion of variance not accounted for by all the components
Also much more important in factor analysis

Do not pay attention to statistical test of sufficiency of components. It isn’t really relevant. I don’t think it should even be printed out

Let’s now combine component scores with original data

#combine component scores with original data
example1$PC1 <- output1$scores[,1]

head(example1)

## # A tibble: 6 × 10
##   picvoc letword passcomp  calc appprobs peercomp  coop assert selfcont    PC1
##    <dbl>   <dbl>    <dbl> <dbl>    <dbl>    <dbl> <dbl>  <dbl>    <dbl>  <dbl>
## 1    509     498      496   503      508       17    18     11       17 -0.900
## 2    512     514      508   534      530       13    14     13       11  1.02 
## 3    512     524      510   514      511       15    17     11       16  0.313
## 4    486     488      493   503      499        9     2     14        5 -1.93 
## 5    512     528      521   523      516       18    19     15       17  1.00 
## 6    505     520      510   528      525       18    17     19       18  0.753

Note that component scores are automatically standardized in the way they are created

mean(example1$PC1) #very close to 0

## [1] 6.015978e-16

SD(example1$PC1) #SD = 1

## [1] 1

Now, let’s get to the point: why should we do this?

output_picvoc <- lm(picvoc~peercomp + coop + assert + selfcont,data=example1)
summary(output_picvoc)

## 
## Call:
## lm(formula = picvoc ~ peercomp + coop + assert + selfcont, data = example1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.006  -7.261   0.017   7.635  35.058 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 502.3714     2.2246 225.826  < 2e-16 ***
## peercomp     -0.8702     0.4111  -2.117 0.034678 *  
## coop          0.2486     0.1524   1.631 0.103292    
## assert        0.6355     0.1831   3.471 0.000552 ***
## selfcont      0.4314     0.3131   1.378 0.168730    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.69 on 647 degrees of freedom
## Multiple R-squared:  0.02754,    Adjusted R-squared:  0.02153 
## F-statistic: 4.582 on 4 and 647 DF,  p-value: 0.00118

output_letword <- lm(letword~peercomp + coop + assert + selfcont,data=example1)
summary(output_letword)

## 
## Call:
## lm(formula = letword ~ peercomp + coop + assert + selfcont, data = example1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.908 -10.178   0.257  10.439  43.523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 506.7609     3.1008 163.429   <2e-16 ***
## peercomp     -1.3426     0.5731  -2.343   0.0194 *  
## coop          0.6814     0.2124   3.209   0.0014 ** 
## assert        0.6076     0.2552   2.381   0.0176 *  
## selfcont      0.5189     0.4365   1.189   0.2349    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.91 on 647 degrees of freedom
## Multiple R-squared:  0.02533,    Adjusted R-squared:  0.01931 
## F-statistic: 4.204 on 4 and 647 DF,  p-value: 0.002282

output_passcomp <- lm(passcomp~peercomp + coop + assert + selfcont,data=example1)
summary(output_passcomp)

## 
## Call:
## lm(formula = passcomp ~ peercomp + coop + assert + selfcont, 
##     data = example1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.916  -6.171   0.541   6.813  33.798 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 498.1239     2.1309 233.759  < 2e-16 ***
## peercomp     -0.2432     0.3938  -0.618  0.53702    
## coop          0.4644     0.1459   3.182  0.00153 ** 
## assert        0.2369     0.1754   1.351  0.17717    
## selfcont      0.1544     0.2999   0.515  0.60685    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.24 on 647 degrees of freedom
## Multiple R-squared:  0.03494,    Adjusted R-squared:  0.02897 
## F-statistic: 5.856 on 4 and 647 DF,  p-value: 0.0001241

output_calc <- lm(calc~peercomp + coop + assert + selfcont,data=example1)
summary(output_calc)

## 
## Call:
## lm(formula = calc ~ peercomp + coop + assert + selfcont, data = example1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.362  -7.669  -0.488   7.417  38.125 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 503.0109     2.3710 212.150  < 2e-16 ***
## peercomp     -0.2471     0.4382  -0.564    0.573    
## coop          0.8966     0.1624   5.522 4.86e-08 ***
## assert        0.2721     0.1951   1.395    0.164    
## selfcont     -0.2522     0.3337  -0.756    0.450    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.4 on 647 degrees of freedom
## Multiple R-squared:  0.0563, Adjusted R-squared:  0.05046 
## F-statistic: 9.649 on 4 and 647 DF,  p-value: 1.389e-07

output_appprobs <- lm(appprobs~peercomp + coop + assert + selfcont,data=example1)
summary(output_appprobs)

## 
## Call:
## lm(formula = appprobs ~ peercomp + coop + assert + selfcont, 
##     data = example1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.131  -5.966   0.455   6.210  32.106 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 504.217379   2.053940 245.488  < 2e-16 ***
## peercomp     -0.549537   0.379594  -1.448   0.1482    
## coop          0.734154   0.140666   5.219 2.42e-07 ***
## assert        0.344865   0.169037   2.040   0.0417 *  
## selfcont     -0.003132   0.289108  -0.011   0.9914    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.873 on 647 degrees of freedom
## Multiple R-squared:  0.05106,    Adjusted R-squared:  0.04519 
## F-statistic: 8.703 on 4 and 647 DF,  p-value: 7.599e-07

#notice here I used principal component scores as the DV
output_pc <- lm(PC1~peercomp + coop + assert + selfcont,data=example1)
summary(output_pc)

## 
## Call:
## lm(formula = PC1 ~ peercomp + coop + assert + selfcont, data = example1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6111 -0.6462  0.0561  0.6821  3.3204 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.94275    0.20257  -4.654 3.95e-06 ***
## peercomp    -0.07026    0.03744  -1.877  0.06099 .  
## coop         0.06874    0.01387   4.955 9.25e-07 ***
## assert       0.04670    0.01667   2.801  0.00525 ** 
## selfcont     0.01784    0.02851   0.626  0.53168    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9738 on 647 degrees of freedom
## Multiple R-squared:  0.05756,    Adjusted R-squared:  0.05174 
## F-statistic:  9.88 on 4 and 647 DF,  p-value: 9.188e-08

Like magic, I just turned the challenge of summarizing five regressions into one regression using a summary of the variables.

Example 2

The second example is from multiple studies by Salthouse, which includes six variables that assess cognitive abilities:

Knowledge of Synonyms
Knowledge of Antonyms
Performance on Raven’s Matrices
Spatial Relations
Letter Comparison
Pattern Comparison

The objective is to examine the relationship between these cognitive abilities and age.

#load the data
example2<-read_excel("PCA_example2.xlsx")

#examine a snapshot of the data
head(example2)

## # A tibble: 6 × 8
##     age   syn   ant raven sparel letcom patcom dummy
##   <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl>
## 1    74     8    10  0.06      7    5.5     10     0
## 2    31     5     5  0.44     15   13       20     0
## 3    42     3     3  0.06      2    6       10     0
## 4    69    10    10  0.5       8   11.5     15     0
## 5    31     5     2  0.5       9   11.5     18     0
## 6    66     6     5  0.13      3    8       15     0

#data summary
summary(example2)

##       age             syn              ant             raven       
##  Min.   :18.00   Min.   : 0.000   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:36.00   1st Qu.: 6.000   1st Qu.: 5.000   1st Qu.:0.3100  
##  Median :49.00   Median : 9.000   Median : 8.000   Median :0.5000  
##  Mean   :49.04   Mean   : 7.583   Mean   : 7.244   Mean   :0.4826  
##  3rd Qu.:62.00   3rd Qu.:10.000   3rd Qu.:10.000   3rd Qu.:0.6300  
##  Max.   :91.00   Max.   :10.000   Max.   :10.000   Max.   :1.0000  
##      sparel           letcom          patcom          dummy  
##  Min.   : 0.000   Min.   :-6.50   Min.   : 0.00   Min.   :0  
##  1st Qu.: 5.000   1st Qu.: 9.00   1st Qu.:15.00   1st Qu.:0  
##  Median : 9.000   Median :10.50   Median :18.00   Median :0  
##  Mean   : 9.365   Mean   :10.59   Mean   :17.99   Mean   :0  
##  3rd Qu.:13.000   3rd Qu.:12.50   3rd Qu.:21.00   3rd Qu.:0  
##  Max.   :20.000   Max.   :20.50   Max.   :30.00   Max.   :0

Now let’s examine the relationship between age and these cognitive abilities using linear regression

output_syn <- lm(syn ~ age, data=example2)
summary(output_syn) #positive coefficient

## 
## Call:
## lm(formula = syn ~ age, data = example2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.610 -1.424  0.832  1.885  4.075 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.963093   0.279684  17.745   <2e-16 ***
## age         0.053415   0.005398   9.896   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.592 on 822 degrees of freedom
## Multiple R-squared:  0.1064, Adjusted R-squared:  0.1054 
## F-statistic: 97.92 on 1 and 822 DF,  p-value: < 2.2e-16

output_ant <- lm(ant ~ age, data=example2)
summary(output_ant) #positive coefficient

## 
## Call:
## lm(formula = ant ~ age, data = example2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4902 -1.6726  0.8438  2.1450  4.0487 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.133690   0.297512  17.255  < 2e-16 ***
## age         0.043032   0.005742   7.494 1.72e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.757 on 822 degrees of freedom
## Multiple R-squared:  0.06396,    Adjusted R-squared:  0.06282 
## F-statistic: 56.16 on 1 and 822 DF,  p-value: 1.725e-13

output_raven <- lm(raven ~ age, data=example2)
summary(output_raven) #negative coefficient

## 
## Call:
## lm(formula = raven ~ age, data = example2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6413 -0.1127  0.0167  0.1299  0.4317 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.7814054  0.0192848   40.52   <2e-16 ***
## age         -0.0060933  0.0003722  -16.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1787 on 822 degrees of freedom
## Multiple R-squared:  0.2459, Adjusted R-squared:  0.245 
## F-statistic:   268 on 1 and 822 DF,  p-value: < 2.2e-16

output_sparel <- lm(sparel ~ age, data=example2)
summary(output_sparel) #negative coefficient

## 
## Call:
## lm(formula = sparel ~ age, data = example2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2801  -3.8185  -0.4324   3.6027  12.6686 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.12371    0.52415  26.946   <2e-16 ***
## age         -0.09703    0.01012  -9.592   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.858 on 822 degrees of freedom
## Multiple R-squared:  0.1007, Adjusted R-squared:  0.09957 
## F-statistic: 92.01 on 1 and 822 DF,  p-value: < 2.2e-16

output_letcom <- lm(letcom ~ age, data=example2)
summary(output_letcom) #negative coefficient

## 
## Call:
## lm(formula = letcom ~ age, data = example2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9039  -1.4701   0.1423   1.5315   8.4702 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.841266   0.281272   49.21   <2e-16 ***
## age         -0.066230   0.005429  -12.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.607 on 822 degrees of freedom
## Multiple R-squared:  0.1533, Adjusted R-squared:  0.1523 
## F-statistic: 148.8 on 1 and 822 DF,  p-value: < 2.2e-16

output_patcom <- lm(patcom ~ age, data=example2)
summary(output_patcom) #negative coefficient

## 
## Call:
## lm(formula = patcom ~ age, data = example2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.9815  -2.0445   0.0058   2.1287  10.7879 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.485922   0.379427   64.53   <2e-16 ***
## age         -0.132484   0.007323  -18.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.516 on 822 degrees of freedom
## Multiple R-squared:  0.2848, Adjusted R-squared:  0.2839 
## F-statistic: 327.3 on 1 and 822 DF,  p-value: < 2.2e-16

The results from regression analyses seems a lot harder to summarize

Let’s see what PCA says

fa.parallel(example2[,2:7],fa="pc")

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2

Ah, we need 2 components (#2 components are needed to summarize these variables)

output2 <- principal(example2[,2:7],nfactors=2,scores = TRUE,rotate="none")
output2

## Principal Components Analysis
## Call: principal(r = example2[, 2:7], nfactors = 2, rotate = "none", 
##     scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##         PC1   PC2   h2   u2 com
## syn    0.48  0.80 0.87 0.13 1.6
## ant    0.56  0.73 0.85 0.15 1.9
## raven  0.81 -0.20 0.70 0.30 1.1
## sparel 0.78 -0.06 0.62 0.38 1.0
## letcom 0.68 -0.38 0.60 0.40 1.6
## patcom 0.70 -0.47 0.71 0.29 1.7
## 
##                        PC1  PC2
## SS loadings           2.77 1.58
## Proportion Var        0.46 0.26
## Cumulative Var        0.46 0.73
## Proportion Explained  0.64 0.36
## Cumulative Proportion 0.64 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.1 
##  with the empirical chi square  249.25  with prob <  9.5e-53 
## 
## Fit based upon off diagonal values = 0.94

First component accounts for 46% of variance. Remember that PCA is in order of magnitude. This is the largest amount of shared variance that any linear combination can yield

Second accounts for 26% of variance

Cumulative: both together account for 73% of the variance in the 6 variables

Loadings for first principal component are all positive and moderate to large. First component is summary of all variables.

Loadings for second principal component are a mixture of positive and negative. This is seriously hard to understand

This pattern is quite common (i.e., all positive loadings on first component, mixture of positive and negative on other components)

This is where rotation may be helpful.

Orthogonal rotation: varimax

output3 <- principal(example2[,2:7],nfactors=2,scores = TRUE,rotate="varimax")
output3

## Principal Components Analysis
## Call: principal(r = example2[, 2:7], nfactors = 2, rotate = "varimax", 
##     scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##         RC1   RC2   h2   u2 com
## syn    0.04  0.93 0.87 0.13 1.0
## ant    0.14  0.91 0.85 0.15 1.0
## raven  0.81  0.22 0.70 0.30 1.1
## sparel 0.72  0.32 0.62 0.38 1.4
## letcom 0.78 -0.01 0.60 0.40 1.0
## patcom 0.84 -0.08 0.71 0.29 1.0
## 
##                        RC1  RC2
## SS loadings           2.49 1.86
## Proportion Var        0.42 0.31
## Cumulative Var        0.42 0.73
## Proportion Explained  0.57 0.43
## Cumulative Proportion 0.57 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.1 
##  with the empirical chi square  249.25  with prob <  9.5e-53 
## 
## Fit based upon off diagonal values = 0.94

That’s better. Now we have much better interpretation

First component is raven, sparel, letcom, patcom

Second component is syn, ant

This is consistent with theory, which says syn and ant are crystallized skills and the other are fluid skills

First component now accounts for 42% of variance
Second 31%
Sum is still 73%

This rotation has just redistributed variance to increase interpretation

Oblique rotation

output4 <- principal(example2[,2:7],nfactors=2,scores = TRUE,rotate="promax")
output4

## Principal Components Analysis
## Call: principal(r = example2[, 2:7], nfactors = 2, rotate = "promax", 
##     scores = TRUE)
## Standardized loadings (pattern matrix) based upon correlation matrix
##          RC1   RC2   h2   u2 com
## syn    -0.08  0.95 0.87 0.13 1.0
## ant     0.03  0.92 0.85 0.15 1.0
## raven   0.80  0.12 0.70 0.30 1.0
## sparel  0.70  0.24 0.62 0.38 1.2
## letcom  0.80 -0.11 0.60 0.40 1.0
## patcom  0.87 -0.19 0.71 0.29 1.1
## 
##                        RC1  RC2
## SS loadings           2.51 1.85
## Proportion Var        0.42 0.31
## Cumulative Var        0.42 0.73
## Proportion Explained  0.58 0.42
## Cumulative Proportion 0.58 1.00
## 
##  With component correlations of 
##      RC1  RC2
## RC1 1.00 0.25
## RC2 0.25 1.00
## 
## Mean item complexity =  1.1
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.1 
##  with the empirical chi square  249.25  with prob <  9.5e-53 
## 
## Fit based upon off diagonal values = 0.94

The total variance explained is the same, although each component accounts for different amounts and they overlap because the component correlation is .25

Square that to get .0625, so they only share 6.25% of variance the change is small enough that you can only see it in SS loadings

If correlation were larger, proportion var can look different but components together will still account for the same amount of variance

Here we can see that rotation changes the way variance is shared, but not the total amount of shared variance

After recording the lecture, I realized…

That the psych package principal function outputs only the pattern matrix, not the structure matrix. They are mathematically related, but the structure matrix would be more appropriate

Revelle, the author of psych, comes from a factor analytic tradition. Kudos to him for having a separate function for PCA, but I wish there were some way to output the structure matrix

The component scores are created (correctly) from the structure matrix, but the interpretation of the components is less precise without the structure matrix

The two matrices will generally be similar

As we’ll in factor analysis, the variables are linear combinations of the factors and the pattern matrix is the coefficients for those linear combinations

For PCA, the pattern matrix is wrong- we need the coefficients for the linear combinations of variables that create the components- the opposite direction

There is a way to calculate the structure matrix with an R script but it’s not worth the effort

We’ll just live with the wrong matrix

#combine components with original data
example2$PC1 <- output4$scores[,1]
example2$PC2 <- output4$scores[,2]

#check correlation
cor(example2$PC1,example2$PC2)

## [1] 0.2461769

Matches what the analysis said

This happens with PCA, but not factor analysis

output_PC1 <- lm(PC1 ~ age, data=example2)
summary(output_PC1)

## 
## Call:
## lm(formula = PC1 ~ age, data = example2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2138 -0.5493  0.0215  0.5684  2.5042 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.628212   0.089760   18.14   <2e-16 ***
## age         -0.033202   0.001732  -19.17   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8319 on 822 degrees of freedom
## Multiple R-squared:  0.3089, Adjusted R-squared:  0.308 
## F-statistic: 367.3 on 1 and 822 DF,  p-value: < 2.2e-16

output_PC2 <- lm(PC2 ~ age, data=example2)
summary(output_PC2)

## 
## Call:
## lm(formula = PC2 ~ age, data = example2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1209 -0.5769  0.2673  0.7216  1.5826 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.833832   0.103504  -8.056 2.76e-15 ***
## age          0.017003   0.001998   8.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9592 on 822 degrees of freedom
## Multiple R-squared:  0.081,  Adjusted R-squared:  0.07988 
## F-statistic: 72.45 on 1 and 822 DF,  p-value: < 2.2e-16

These results are easier to understand:

PC1 (fluid skills) negatively related to age (difference of 1 year associated with .03 SD lower scores on PC)
PC2 (crystallized skills) positively related to age (difference of 1 year associated with .017 SD higher scores on PC)

Yes, a little harder to interpret PC rather than original variables but at least it’s only 2 regressions instead of 6