Canonical Correlation

Load required packages

library(ggplot2)
library(GGally)

## Warning: package 'GGally' was built under R version 4.1.3

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(CCA)

## Warning: package 'CCA' was built under R version 4.1.3

## Loading required package: fda

## Warning: package 'fda' was built under R version 4.1.3

## Loading required package: splines

## Loading required package: fds

## Warning: package 'fds' was built under R version 4.1.3

## Loading required package: rainbow

## Warning: package 'rainbow' was built under R version 4.1.3

## Loading required package: MASS

## Loading required package: pcaPP

## Warning: package 'pcaPP' was built under R version 4.1.3

## Loading required package: RCurl

## Loading required package: deSolve

## Warning: package 'deSolve' was built under R version 4.1.3

## 
## Attaching package: 'fda'

## The following object is masked from 'package:graphics':
## 
##     matplot

## Loading required package: fields

## Warning: package 'fields' was built under R version 4.1.3

## Loading required package: spam

## Warning: package 'spam' was built under R version 4.1.3

## Spam version 2.8-0 (2022-01-05) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction 
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.

## 
## Attaching package: 'spam'

## The following objects are masked from 'package:base':
## 
##     backsolve, forwardsolve

## Loading required package: viridis

## Warning: package 'viridis' was built under R version 4.1.3

## Loading required package: viridisLite

## 
## Try help(fields) to get started.

Load the given data:

mm <- read.csv("https://stats.idre.ucla.edu/stat/data/mmreg.csv")
colnames(mm) <- c("Control", "Concept", "Motivation", "Read", "Write", "Math", 
    "Science", "Sex")
summary(mm)

##     Control            Concept            Motivation          Read     
##  Min.   :-2.23000   Min.   :-2.620000   Min.   :0.0000   Min.   :28.3  
##  1st Qu.:-0.37250   1st Qu.:-0.300000   1st Qu.:0.3300   1st Qu.:44.2  
##  Median : 0.21000   Median : 0.030000   Median :0.6700   Median :52.1  
##  Mean   : 0.09653   Mean   : 0.004917   Mean   :0.6608   Mean   :51.9  
##  3rd Qu.: 0.51000   3rd Qu.: 0.440000   3rd Qu.:1.0000   3rd Qu.:60.1  
##  Max.   : 1.36000   Max.   : 1.190000   Max.   :1.0000   Max.   :76.0  
##      Write            Math          Science           Sex       
##  Min.   :25.50   Min.   :31.80   Min.   :26.00   Min.   :0.000  
##  1st Qu.:44.30   1st Qu.:44.50   1st Qu.:44.40   1st Qu.:0.000  
##  Median :54.10   Median :51.30   Median :52.60   Median :1.000  
##  Mean   :52.38   Mean   :51.85   Mean   :51.76   Mean   :0.545  
##  3rd Qu.:59.90   3rd Qu.:58.38   3rd Qu.:58.65   3rd Qu.:1.000  
##  Max.   :67.10   Max.   :75.50   Max.   :74.20   Max.   :1.000

Canonical Correlation Analysis

xtabs(~Sex, data = mm)

## Sex
##   0   1 
## 273 327

psych <- mm[, 1:3]
acad <- mm[, 4:8]

ggpairs(psych)

ggpairs(acad)

Now, we will look at the correlations within and between the two sets of variables using the matcor function from the CCA package.

# correlations
matcor(psych, acad)

## $Xcor
##              Control   Concept Motivation
## Control    1.0000000 0.1711878  0.2451323
## Concept    0.1711878 1.0000000  0.2885707
## Motivation 0.2451323 0.2885707  1.0000000
## 
## $Ycor
##                Read     Write       Math    Science         Sex
## Read     1.00000000 0.6285909  0.6792757  0.6906929 -0.04174278
## Write    0.62859089 1.0000000  0.6326664  0.5691498  0.24433183
## Math     0.67927568 0.6326664  1.0000000  0.6495261 -0.04821830
## Science  0.69069291 0.5691498  0.6495261  1.0000000 -0.13818587
## Sex     -0.04174278 0.2443318 -0.0482183 -0.1381859  1.00000000
## 
## $XYcor
##              Control     Concept Motivation        Read      Write       Math
## Control    1.0000000  0.17118778 0.24513227  0.37356505 0.35887684  0.3372690
## Concept    0.1711878  1.00000000 0.28857075  0.06065584 0.01944856  0.0535977
## Motivation 0.2451323  0.28857075 1.00000000  0.21060992 0.25424818  0.1950135
## Read       0.3735650  0.06065584 0.21060992  1.00000000 0.62859089  0.6792757
## Write      0.3588768  0.01944856 0.25424818  0.62859089 1.00000000  0.6326664
## Math       0.3372690  0.05359770 0.19501347  0.67927568 0.63266640  1.0000000
## Science    0.3246269  0.06982633 0.11566948  0.69069291 0.56914983  0.6495261
## Sex        0.1134108 -0.12595132 0.09810277 -0.04174278 0.24433183 -0.0482183
##                Science         Sex
## Control     0.32462694  0.11341075
## Concept     0.06982633 -0.12595132
## Motivation  0.11566948  0.09810277
## Read        0.69069291 -0.04174278
## Write       0.56914983  0.24433183
## Math        0.64952612 -0.04821830
## Science     1.00000000 -0.13818587
## Sex        -0.13818587  1.00000000

R Canonical Correlation Analysis Due to the length of the output, we will be making comments in several places along the way.

cc1 <- cc(psych, acad)

# display the canonical correlations
cc1$cor

## [1] 0.4640861 0.1675092 0.1039911

# raw canonical coefficients
cc1[3:4]

## $xcoef
##                  [,1]       [,2]       [,3]
## Control    -1.2538339 -0.6214776 -0.6616896
## Concept     0.3513499 -1.1876866  0.8267210
## Motivation -1.2624204  2.0272641  2.0002283
## 
## $ycoef
##                 [,1]         [,2]         [,3]
## Read    -0.044620600 -0.004910024  0.021380576
## Write   -0.035877112  0.042071478  0.091307329
## Math    -0.023417185  0.004229478  0.009398182
## Science -0.005025152 -0.085162184 -0.109835014
## Sex     -0.632119234  1.084642326 -1.794647036

The raw canonical coefficients are interpreted in a manner analogous to interpreting regression coefficients i.e., for the variable read, a one unit increase in reading leads to a .0446 decrease in the first canonical variate of set 2 when all of the other variables are held constant. Here is another example: being female leads to a .6321 decrease in the dimension 1 for the academic set with the other predictors held constant.

Next, we’ll use comput to compute the loadings of the variables on the canonical dimensions (variates). These loadings are correlations between variables and the canonical variates.

# compute canonical loadings
cc2 <- comput(psych, acad, cc1)

# display canonical loadings
cc2[3:6]

## $corr.X.xscores
##                   [,1]       [,2]       [,3]
## Control    -0.90404631 -0.3896883 -0.1756227
## Concept    -0.02084327 -0.7087386  0.7051632
## Motivation -0.56715106  0.3508882  0.7451289
## 
## $corr.Y.xscores
##               [,1]        [,2]        [,3]
## Read    -0.3900402 -0.06010654  0.01407661
## Write   -0.4067914  0.01086075  0.02647207
## Math    -0.3545378 -0.04990916  0.01536585
## Science -0.3055607 -0.11336980 -0.02395489
## Sex     -0.1689796  0.12645737 -0.05650916
## 
## $corr.X.yscores
##                    [,1]        [,2]        [,3]
## Control    -0.419555307 -0.06527635 -0.01826320
## Concept    -0.009673069 -0.11872021  0.07333073
## Motivation -0.263206910  0.05877699  0.07748681
## 
## $corr.Y.yscores
##               [,1]        [,2]       [,3]
## Read    -0.8404480 -0.35882541  0.1353635
## Write   -0.8765429  0.06483674  0.2545608
## Math    -0.7639483 -0.29794884  0.1477611
## Science -0.6584139 -0.67679761 -0.2303551
## Sex     -0.3641127  0.75492811 -0.5434036

The above correlations are between observed variables and canonical variables which are known as the canonical loadings. These canonical variates are actually a type of latent variable.

In general, the number of canonical dimensions is equal to the number of variables in the smaller set; however, the number of significant dimensions may be even smaller. Canonical dimensions, also known as canonical variates, are latent variables that are analogous to factors obtained in factor analysis. For this particular model there are three canonical dimensions of which only the first two are statistically significant. For statistical test we use R package “CCP”.

library(CCP)

## Warning: package 'CCP' was built under R version 4.1.3

# tests of canonical dimensions
rho <- cc1$cor
## Define number of observations, number of variables in first set, and number of variables in the second set.
n <- dim(psych)[1]
p <- length(psych)
q <- length(acad)

## Calculate p-values using the F-approximations of different test statistics:
p.asym(rho, n, p, q, tstat = "Wilks")

## Wilks' Lambda, using F-approximation (Rao's F):
##               stat    approx df1      df2     p.value
## 1 to 3:  0.7543611 11.715733  15 1634.653 0.000000000
## 2 to 3:  0.9614300  2.944459   8 1186.000 0.002905057
## 3 to 3:  0.9891858  2.164612   3  594.000 0.091092180

p.asym(rho, n, p, q, tstat = "Hotelling")

##  Hotelling-Lawley Trace, using F-approximation:
##                stat    approx df1  df2     p.value
## 1 to 3:  0.31429738 12.376333  15 1772 0.000000000
## 2 to 3:  0.03980175  2.948647   8 1778 0.002806614
## 3 to 3:  0.01093238  2.167041   3 1784 0.090013176

p.asym(rho, n, p, q, tstat = "Pillai")

##  Pillai-Bartlett Trace, using F-approximation:
##                stat    approx df1  df2     p.value
## 1 to 3:  0.25424936 11.000571  15 1782 0.000000000
## 2 to 3:  0.03887348  2.934093   8 1788 0.002932565
## 3 to 3:  0.01081416  2.163421   3 1794 0.090440474

p.asym(rho, n, p, q, tstat = "Roy")

##  Roy's Largest Root, using F-approximation:
##               stat   approx df1 df2 p.value
## 1 to 1:  0.2153759 32.61008   5 594       0
## 
##  F statistic for Roy's Greatest Root is an upper bound.

As shown in the table above, the first test of the canonical dimensions tests whether all three dimensions are significant (they are, F = 11.72), the next test tests whether dimensions 2 and 3 combined are significant (they are, F = 2.94). Finally, the last test tests whether dimension 3, by itself, is significant (it is not). Therefore dimensions 1 and 2 must each be significant while dimension three is not.

When the variables in the model have very different standard deviations, the standardized coefficients allow for easier comparisons among the variables. Next, we’ll compute the standardized canonical coefficients.

# standardized psych canonical coefficients diagonal matrix of psych sd's
s1 <- diag(sqrt(diag(cov(psych))))
s1 %*% cc1$xcoef

##            [,1]       [,2]       [,3]
## [1,] -0.8404196 -0.4165639 -0.4435172
## [2,]  0.2478818 -0.8379278  0.5832620
## [3,] -0.4326685  0.6948029  0.6855370

# standardized acad canonical coefficients diagonal matrix of acad sd's
s2 <- diag(sqrt(diag(cov(acad))))
s2 %*% cc1$ycoef

##             [,1]        [,2]        [,3]
## [1,] -0.45080116 -0.04960589  0.21600760
## [2,] -0.34895712  0.40920634  0.88809662
## [3,] -0.22046662  0.03981942  0.08848141
## [4,] -0.04877502 -0.82659938 -1.06607828
## [5,] -0.31503962  0.54057096 -0.89442764

The standardized canonical coefficients are interpreted in a manner analogous to interpreting standardized regression coefficients. For example, consider the variable read, a one standard deviation increase in reading leads to a 0.45 standard deviation decrease in the score on the first canonical variate for set 2 when the other variables in the model are held constant.

Canonical Correlation

Darwin Mangubat

2022-05-15