STAT 56 - Canonical Correlation Analysis

require(ggplot2)

Loading required package: ggplot2

require(GGally)

Loading required package: GGally

Warning: package 'GGally' was built under R version 4.2.3

Registered S3 method overwritten by 'GGally':
  method from   
  +.gg   ggplot2

require(CCA)

Loading required package: CCA

Warning: package 'CCA' was built under R version 4.2.3

Loading required package: fda

Warning: package 'fda' was built under R version 4.2.3

Loading required package: splines

Loading required package: fds

Warning: package 'fds' was built under R version 4.2.3

Loading required package: rainbow

Warning: package 'rainbow' was built under R version 4.2.3

Loading required package: MASS

Loading required package: pcaPP

Warning: package 'pcaPP' was built under R version 4.2.3

Loading required package: RCurl

Loading required package: deSolve

Warning: package 'deSolve' was built under R version 4.2.3


Attaching package: 'fda'

The following object is masked from 'package:graphics':

    matplot

Loading required package: fields

Warning: package 'fields' was built under R version 4.2.3

Loading required package: spam

Warning: package 'spam' was built under R version 4.2.3

Spam version 2.9-1 (2022-08-07) is loaded.
Type 'help( Spam)' or 'demo( spam)' for a short introduction 
and overview of this package.
Help for individual functions is also obtained by adding the
suffix '.spam' to the function name, e.g. 'help( chol.spam)'.


Attaching package: 'spam'

The following objects are masked from 'package:base':

    backsolve, forwardsolve

Loading required package: viridis

Loading required package: viridisLite


Try help(fields) to get started.

Description of the data

mm <- read.csv("https://stats.idre.ucla.edu/stat/data/mmreg.csv")
colnames(mm) <- c("Control", "Concept", "Motivation", "Read", "Write", "Math", 
    "Science", "Sex")
summary(mm)

    Control            Concept            Motivation          Read     
 Min.   :-2.23000   Min.   :-2.620000   Min.   :0.0000   Min.   :28.3  
 1st Qu.:-0.37250   1st Qu.:-0.300000   1st Qu.:0.3300   1st Qu.:44.2  
 Median : 0.21000   Median : 0.030000   Median :0.6700   Median :52.1  
 Mean   : 0.09653   Mean   : 0.004917   Mean   :0.6608   Mean   :51.9  
 3rd Qu.: 0.51000   3rd Qu.: 0.440000   3rd Qu.:1.0000   3rd Qu.:60.1  
 Max.   : 1.36000   Max.   : 1.190000   Max.   :1.0000   Max.   :76.0  
     Write            Math          Science           Sex       
 Min.   :25.50   Min.   :31.80   Min.   :26.00   Min.   :0.000  
 1st Qu.:44.30   1st Qu.:44.50   1st Qu.:44.40   1st Qu.:0.000  
 Median :54.10   Median :51.30   Median :52.60   Median :1.000  
 Mean   :52.38   Mean   :51.85   Mean   :51.76   Mean   :0.545  
 3rd Qu.:59.90   3rd Qu.:58.38   3rd Qu.:58.65   3rd Qu.:1.000  
 Max.   :67.10   Max.   :75.50   Max.   :74.20   Max.   :1.000

Analysis methods you might consider

Below is a list of some analysis methods you may have encountered. Some of the methods listed are quite reasonable while others have either fallen out of favor or have limitations.

Canonical correlation analysis, the focus of this page.
Separate OLS Regressions – You could analyze these data using separate OLS regression analyses for each variable in one set. The OLS regressions will not produce multivariate results and does not report information concerning dimensionality.
Multivariate multiple regression is a reasonable option if you have no interest in dimensionality.

Canonical correlation analysis

xtabs(~Sex, data = mm)

Sex
  0   1 
273 327

psych <- mm[, 1:3]
acad <- mm[, 4:8]

ggpairs(psych)

ggpairs(acad)

# correlations
matcor(psych, acad)

$Xcor
             Control   Concept Motivation
Control    1.0000000 0.1711878  0.2451323
Concept    0.1711878 1.0000000  0.2885707
Motivation 0.2451323 0.2885707  1.0000000

$Ycor
               Read     Write       Math    Science         Sex
Read     1.00000000 0.6285909  0.6792757  0.6906929 -0.04174278
Write    0.62859089 1.0000000  0.6326664  0.5691498  0.24433183
Math     0.67927568 0.6326664  1.0000000  0.6495261 -0.04821830
Science  0.69069291 0.5691498  0.6495261  1.0000000 -0.13818587
Sex     -0.04174278 0.2443318 -0.0482183 -0.1381859  1.00000000

$XYcor
             Control     Concept Motivation        Read      Write       Math
Control    1.0000000  0.17118778 0.24513227  0.37356505 0.35887684  0.3372690
Concept    0.1711878  1.00000000 0.28857075  0.06065584 0.01944856  0.0535977
Motivation 0.2451323  0.28857075 1.00000000  0.21060992 0.25424818  0.1950135
Read       0.3735650  0.06065584 0.21060992  1.00000000 0.62859089  0.6792757
Write      0.3588768  0.01944856 0.25424818  0.62859089 1.00000000  0.6326664
Math       0.3372690  0.05359770 0.19501347  0.67927568 0.63266640  1.0000000
Science    0.3246269  0.06982633 0.11566948  0.69069291 0.56914983  0.6495261
Sex        0.1134108 -0.12595132 0.09810277 -0.04174278 0.24433183 -0.0482183
               Science         Sex
Control     0.32462694  0.11341075
Concept     0.06982633 -0.12595132
Motivation  0.11566948  0.09810277
Read        0.69069291 -0.04174278
Write       0.56914983  0.24433183
Math        0.64952612 -0.04821830
Science     1.00000000 -0.13818587
Sex        -0.13818587  1.00000000

Some Strategies You Might Be Tempted To Try

Before we show how you can analyze this with a canonical correlation analysis, let’s consider some other methods that you might use.

Separate OLS Regressions – You could analyze these data using separate OLS regression analyses for each variable in one set. The OLS regressions will not produce multivariate results and does not report information concerning dimensionality.
Multivariate multiple regression is a reasonable option if you have no interest in dimensionality.

R Canonical Correlation Analysis

cc1 <- cc(psych, acad)

# display the canonical correlations
cc1$cor

[1] 0.4640861 0.1675092 0.1039911

# raw canonical coefficients
cc1[3:4]

$xcoef
                 [,1]       [,2]       [,3]
Control    -1.2538339 -0.6214776 -0.6616896
Concept     0.3513499 -1.1876866  0.8267210
Motivation -1.2624204  2.0272641  2.0002283

$ycoef
                [,1]         [,2]         [,3]
Read    -0.044620600 -0.004910024  0.021380576
Write   -0.035877112  0.042071478  0.091307329
Math    -0.023417185  0.004229478  0.009398182
Science -0.005025152 -0.085162184 -0.109835014
Sex     -0.632119234  1.084642326 -1.794647036

Next, we’ll use comput to compute the loadings of the variables on the canonical dimensions (variates). These loadings are correlations between variables and the canonical variates.

# compute canonical loadings
cc2 <- comput(psych, acad, cc1)

# display canonical loadings
cc2[3:6]

$corr.X.xscores
                  [,1]       [,2]       [,3]
Control    -0.90404631 -0.3896883 -0.1756227
Concept    -0.02084327 -0.7087386  0.7051632
Motivation -0.56715106  0.3508882  0.7451289

$corr.Y.xscores
              [,1]        [,2]        [,3]
Read    -0.3900402 -0.06010654  0.01407661
Write   -0.4067914  0.01086075  0.02647207
Math    -0.3545378 -0.04990916  0.01536585
Science -0.3055607 -0.11336980 -0.02395489
Sex     -0.1689796  0.12645737 -0.05650916

$corr.X.yscores
                   [,1]        [,2]        [,3]
Control    -0.419555307 -0.06527635 -0.01826320
Concept    -0.009673069 -0.11872021  0.07333073
Motivation -0.263206910  0.05877699  0.07748681

$corr.Y.yscores
              [,1]        [,2]       [,3]
Read    -0.8404480 -0.35882541  0.1353635
Write   -0.8765429  0.06483674  0.2545608
Math    -0.7639483 -0.29794884  0.1477611
Science -0.6584139 -0.67679761 -0.2303551
Sex     -0.3641127  0.75492811 -0.5434036

For this particular model there are three canonical dimensions of which only the first two are statistically significant. For statistical test we use R package “CCP”.

# tests of canonical dimensions
rho <- cc1$cor
## Define number of observations, number of variables in first set, and number of variables in the second set.
n <- dim(psych)[1]
p <- length(psych)
q <- length(acad)

## Calculate p-values using the F-approximations of different test statistics:
p.asym(rho, n, p, q, tstat = "Wilks")

Wilks' Lambda, using F-approximation (Rao's F):
              stat    approx df1      df2     p.value
1 to 3:  0.7543611 11.715733  15 1634.653 0.000000000
2 to 3:  0.9614300  2.944459   8 1186.000 0.002905057
3 to 3:  0.9891858  2.164612   3  594.000 0.091092180

p.asym(rho, n, p, q, tstat = "Hotelling")

 Hotelling-Lawley Trace, using F-approximation:
               stat    approx df1  df2     p.value
1 to 3:  0.31429738 12.376333  15 1772 0.000000000
2 to 3:  0.03980175  2.948647   8 1778 0.002806614
3 to 3:  0.01093238  2.167041   3 1784 0.090013176

p.asym(rho, n, p, q, tstat = "Pillai")

 Pillai-Bartlett Trace, using F-approximation:
               stat    approx df1  df2     p.value
1 to 3:  0.25424936 11.000571  15 1782 0.000000000
2 to 3:  0.03887348  2.934093   8 1788 0.002932565
3 to 3:  0.01081416  2.163421   3 1794 0.090440474

p.asym(rho, n, p, q, tstat = "Roy")

 Roy's Largest Root, using F-approximation:
              stat   approx df1 df2 p.value
1 to 1:  0.2153759 32.61008   5 594       0

 F statistic for Roy's Greatest Root is an upper bound.

When the variables in the model have very different standard deviations, the standardized coefficients allow for easier comparisons among the variables. Next, we’ll compute the standardized canonical coefficients.

# standardized psych canonical coefficients diagonal matrix of psych sd's
s1 <- diag(sqrt(diag(cov(psych))))
s1 %*% cc1$xcoef

           [,1]       [,2]       [,3]
[1,] -0.8404196 -0.4165639 -0.4435172
[2,]  0.2478818 -0.8379278  0.5832620
[3,] -0.4326685  0.6948029  0.6855370

# standardized acad canonical coefficients diagonal matrix of acad sd's
s2 <- diag(sqrt(diag(cov(acad))))
s2 %*% cc1$ycoef

            [,1]        [,2]        [,3]
[1,] -0.45080116 -0.04960589  0.21600760
[2,] -0.34895712  0.40920634  0.88809662
[3,] -0.22046662  0.03981942  0.08848141
[4,] -0.04877502 -0.82659938 -1.06607828
[5,] -0.31503962  0.54057096 -0.89442764

Sample Write-Up of the Analysis

There is a lot of variation in the write-ups of canonical correlation analyses. The write-up below is fairly minimal, including only the tests of dimensionality and the standardized coefficients.