Canonical Correlation Analysis

Canonical Correlational Analysis (CCA) is a class of multivariate statistical analysis technique that is used to analyze multiple measurements of objects simultaneously.

There is a small difference between CCA and PCA- Principal Component Analysis;

CCA is used to correlate simultaneously several metric Dependent Variable (DV) and several metric Independent Variables (IV) on the same experimental unit.
On the other hand, PCA is often concerned with reducing the data dimensionality of a sole dataset through having a few linear combinations of the initial variables.
The technique is an extension of multiple correlation analysis and is often applicable in the same situations in which multivariate regression analysis methods would be applicable.

The purpose of CCA are:

Reducing data dimension by combining variables into linear format.
Finding covariance between the given sets of variables (canonical variates).

CCA is also resistant to change in scale and offers more explanation than simple correlation in terms of multivariate analysis.

For data set, we are going to use the red wine quality dataset available from the UCI Machine learning repository. The data has N = 1599, 12 variables.

We are interest in determining the number of dimensions (canonical variables) that are significant in explaining the association between the 2 sets of variables.

The data was cleaned of 5 variables and the remaining variables are separated into two sets;

Set 1(ide) consists of 3 variables chlorides, free.sulfur.dioxide, and total.sulfur.dioxide.
Set 2(acidity) consists of 4 variables fixed.acidity, volatile.acidity, citric.acid, and density.

setwd("D:/Class Materials & Work/Summer 2020 practice/Canonical Correlation")

library(lme4)
library(CCA) #facilitates canonical correlation analysis
library(CCP) #facilitates checking the significance of the canonical variates

#Load the data set
pole <- read.csv("trial_data.csv", header = T)

str(pole)

## 'data.frame':    1599 obs. of  7 variables:
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...

#Assigning set
ide <- pole[, 1:3]

acidity <- pole[, 4:7]

Next, we will use CCA::matcor() to examine associations between and within the variable sets. We can also use CCA::img.matcor() to visualize the matrix.

cormat <- matcor(ide,acidity)

img.matcor(cormat, type = 2)

Now, to extract the correlation coefficient within the variable set.

round(cormat$Ycor, 4)

##                  fixed.acidity volatile.acidity citric.acid density
## fixed.acidity           1.0000          -0.2561      0.6717  0.6680
## volatile.acidity       -0.2561           1.0000     -0.5525  0.0220
## citric.acid             0.6717          -0.5525      1.0000  0.3649
## density                 0.6680           0.0220      0.3649  1.0000

round(cormat$Xcor, 4)

##                      chlorides free.sulfur.dioxide total.sulfur.dioxide
## chlorides               1.0000              0.0056               0.0474
## free.sulfur.dioxide     0.0056              1.0000               0.6677
## total.sulfur.dioxide    0.0474              0.6677               1.0000

The associations between the two sets can be extracted as;

round(cormat$XYcor, 4)

##                      chlorides free.sulfur.dioxide total.sulfur.dioxide
## chlorides               1.0000              0.0056               0.0474
## free.sulfur.dioxide     0.0056              1.0000               0.6677
## total.sulfur.dioxide    0.0474              0.6677               1.0000
## fixed.acidity           0.0937             -0.1538              -0.1132
## volatile.acidity        0.0613             -0.0105               0.0765
## citric.acid             0.2038             -0.0610               0.0355
## density                 0.2006             -0.0219               0.0713
##                      fixed.acidity volatile.acidity citric.acid density
## chlorides                   0.0937           0.0613      0.2038  0.2006
## free.sulfur.dioxide        -0.1538          -0.0105     -0.0610 -0.0219
## total.sulfur.dioxide       -0.1132           0.0765      0.0355  0.0713
## fixed.acidity               1.0000          -0.2561      0.6717  0.6680
## volatile.acidity           -0.2561           1.0000     -0.5525  0.0220
## citric.acid                 0.6717          -0.5525      1.0000  0.3649
## density                     0.6680           0.0220      0.3649  1.0000

Next, we are going to obtain the canonical correlations from which we will then extract the raw canonical coefficients.

#obtaining the canonical correlations
can_cor1 <- cc(ide,acidity)

can_cor1$cor

## [1] 0.45361635 0.20703957 0.06092621

#Extracting raw canonical correlation
can_cor1[3:4]

## $xcoef
##                              [,1]         [,2]         [,3]
## chlorides            -15.40227312  6.022271042 -13.39830038
## free.sulfur.dioxide    0.04039044 -0.081896302  -0.09040281
## total.sulfur.dioxide  -0.02578993 -0.004526871   0.03142581
## 
## $ycoef
##                          [,1]         [,2]          [,3]
## fixed.acidity       0.6300078    0.7382125    0.04156297
## volatile.acidity   -3.7269496    2.7250334    5.25358495
## citric.acid        -6.6286891    0.6789206    0.75975488
## density          -382.1226185 -319.7329865 -344.27947229

The interpretation of canonical correlation is similar to that of linear regression. For instance, consider the acidity set. Suppose that we wanted an interpretation of the influence of fixed.acidity on the first canonical variate, the interpretation would be as follows:

A one unit increase in fixed.acidity would result in an increase of 0.63 units in the value of the first canonical variate for the acidity set of variables, when the other variables are held constant.
Furthermore, a one-unit increase in volatile.acidity would result in an increase of about 2.72 units in the second dimension of the acidity set of variables.

We will implement the comput function to compute the correlations between the variables and the canonical variates, as well as the loadings of the variables on the canonical dimensions.

Usually, the number of canonical dimensions is the same as the number of variables in the smaller set. The number of canonical dimensions that are significant in explaining the relationship between the 2 sets of variables may, however, be smaller than the number of variables in the smaller data set. For this practice, we have 3 dimensions.

#computes the canonical loadings
can_cor2 <- comput(ide,acidity,can_cor1)

can_cor2[3:6] #displays the canonical loadings

## $corr.X.xscores
##                            [,1]       [,2]       [,3]
## chlorides            -0.7627757  0.2716167 -0.5868540
## free.sulfur.dioxide  -0.1479687 -0.9544958 -0.2589267
## total.sulfur.dioxide -0.6006467 -0.7074330  0.3725079
## 
## $corr.Y.xscores
##                        [,1]       [,2]        [,3]
## fixed.acidity    -0.0368851 0.17516149 -0.03066069
## volatile.acidity -0.1137480 0.01498495  0.05033043
## citric.acid      -0.2036616 0.10471705 -0.03413442
## density          -0.2151756 0.06505414 -0.03208948
## 
## $corr.X.yscores
##                             [,1]       [,2]        [,3]
## chlorides            -0.34600754  0.0562354 -0.03575479
## free.sulfur.dioxide  -0.06712102 -0.1976184 -0.01577542
## total.sulfur.dioxide -0.27246318 -0.1464666  0.02269549
## 
## $corr.Y.yscores
##                         [,1]       [,2]       [,3]
## fixed.acidity    -0.08131343 0.84602907 -0.5032430
## volatile.acidity -0.25075819 0.07237725  0.8260885
## citric.acid      -0.44897316 0.50578277 -0.5602585
## density          -0.47435584 0.31421115 -0.5266942

To obtain the statistical significance of the dimensions, we are going to use CCP package.

#test of canonical dimensions
rho <- can_cor1$cor

#defining the number of observations, no of variables in first set and number of variables in second set
n <- dim(ide)[1]
p <- length(ide)
q <- length(acidity)

#Calculating the F approximations using Wilk's Statistics
p.asym(rho, n, p, q, tstat="Wilks")

## Wilks' Lambda, using F-approximation (Rao's F):
##               stat    approx df1      df2      p.value
## 1 to 3:  0.7573653 38.878017  12 4212.328 0.000000e+00
## 2 to 3:  0.9535817 12.770396   6 3186.000 2.675637e-14
## 3 to 3:  0.9962880  2.969489   2 1594.000 5.161358e-02

In the above output, the first test determines whether the combined dimensions from 1 to 3 are significant. Since the p-value is less than the alpha = 0.05 level of significance, it follows that all the 3 dimensions are statistically significant (F = 11.72, p =.00).

Similarly, the second test determines the significance of dimension 2 and 3 combined. Since p < 0.05, it follows that the dimensions are statistically significant.

Lastly, the last test determines the significance of the third dimension, which is not statistically significant due to p > 0.05.

Calculating standardized canonical coefficients using R

When the standard deviations between the variables have a large variance between them, the best practice is often to perform a standardization procedure which aids or eases the comparisons among variables.

The standardization of the first set of canonical coefficients(ide) can be done as follows:

std_coef1 <- diag(sqrt(diag(cov(ide))))

std_coef1 %*% can_cor1$xcoef

##            [,1]       [,2]       [,3]
## [1,] -0.7249126  0.2834400 -0.6305951
## [2,]  0.4224903 -0.8566482 -0.9456276
## [3,] -0.8483682 -0.1489129  1.0337622

The standardization of the second set of canonical coefficients(acidity) can be done as follows:

std_coef2 <- diag(sqrt(diag(cov(acidity))))

std_coef2 %*% can_cor1$ycoef

##            [,1]       [,2]        [,3]
## [1,]  1.0969042  1.2852991  0.07236513
## [2,] -0.6673465  0.4879437  0.94070537
## [3,] -1.2912762  0.1322545  0.14800111
## [4,] -0.7211930 -0.6034429 -0.64977034

Interpreting the standardized canonical coefficients

The interpretation of canonical coefficients follows from the interpretation of the standardized regression coefficients.

For instance, in the acidity set of variables, a unit increase in fixed.acidity value would result in a 1.096 unit standard deviation increase on the first canonical variate when all the other variables in the model are held constant.

Note that when using canonical correlation, the sample size should be large, and the variables should follow a multivariate normal distribution due to the assmption of the method. Lastly, as in other research, the sample should be representative to the population.

Reference: https://medium.com/analytics-vidhya/canonical-correlation-analysis-cca-in-r-a-non-technical-primer-b67d9bdeb9dd

Canonical Correlation Analysis

Tarid Wongvorachan

November 24th, 2020

Calculating standardized canonical coefficients using R

Interpreting the standardized canonical coefficients