Canonical Correlational Analysis (CCA) is a class of multivariate statistical analysis technique that is used to analyze multiple measurements of objects simultaneously.

There is a small difference between CCA and PCA- Principal Component Analysis;

The purpose of CCA are:

CCA is also resistant to change in scale and offers more explanation than simple correlation in terms of multivariate analysis.

For data set, we are going to use the red wine quality dataset available from the UCI Machine learning repository. The data has N = 1599, 12 variables.

We are interest in determining the number of dimensions (canonical variables) that are significant in explaining the association between the 2 sets of variables.

The data was cleaned of 5 variables and the remaining variables are separated into two sets;

setwd("D:/Class Materials & Work/Summer 2020 practice/Canonical Correlation")

library(lme4)
library(CCA) #facilitates canonical correlation analysis
library(CCP) #facilitates checking the significance of the canonical variates

#Load the data set
pole <- read.csv("trial_data.csv", header = T)

str(pole)
## 'data.frame':    1599 obs. of  7 variables:
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
#Assigning set
ide <- pole[, 1:3]

acidity <- pole[, 4:7]

Next, we will use CCA::matcor() to examine associations between and within the variable sets. We can also use CCA::img.matcor() to visualize the matrix.

cormat <- matcor(ide,acidity)

img.matcor(cormat, type = 2)

Now, to extract the correlation coefficient within the variable set.

round(cormat$Ycor, 4)
##                  fixed.acidity volatile.acidity citric.acid density
## fixed.acidity           1.0000          -0.2561      0.6717  0.6680
## volatile.acidity       -0.2561           1.0000     -0.5525  0.0220
## citric.acid             0.6717          -0.5525      1.0000  0.3649
## density                 0.6680           0.0220      0.3649  1.0000
round(cormat$Xcor, 4)
##                      chlorides free.sulfur.dioxide total.sulfur.dioxide
## chlorides               1.0000              0.0056               0.0474
## free.sulfur.dioxide     0.0056              1.0000               0.6677
## total.sulfur.dioxide    0.0474              0.6677               1.0000

The associations between the two sets can be extracted as;

round(cormat$XYcor, 4)
##                      chlorides free.sulfur.dioxide total.sulfur.dioxide
## chlorides               1.0000              0.0056               0.0474
## free.sulfur.dioxide     0.0056              1.0000               0.6677
## total.sulfur.dioxide    0.0474              0.6677               1.0000
## fixed.acidity           0.0937             -0.1538              -0.1132
## volatile.acidity        0.0613             -0.0105               0.0765
## citric.acid             0.2038             -0.0610               0.0355
## density                 0.2006             -0.0219               0.0713
##                      fixed.acidity volatile.acidity citric.acid density
## chlorides                   0.0937           0.0613      0.2038  0.2006
## free.sulfur.dioxide        -0.1538          -0.0105     -0.0610 -0.0219
## total.sulfur.dioxide       -0.1132           0.0765      0.0355  0.0713
## fixed.acidity               1.0000          -0.2561      0.6717  0.6680
## volatile.acidity           -0.2561           1.0000     -0.5525  0.0220
## citric.acid                 0.6717          -0.5525      1.0000  0.3649
## density                     0.6680           0.0220      0.3649  1.0000

Next, we are going to obtain the canonical correlations from which we will then extract the raw canonical coefficients.

#obtaining the canonical correlations
can_cor1 <- cc(ide,acidity)

can_cor1$cor
## [1] 0.45361635 0.20703957 0.06092621
#Extracting raw canonical correlation
can_cor1[3:4]
## $xcoef
##                              [,1]         [,2]         [,3]
## chlorides            -15.40227312  6.022271042 -13.39830038
## free.sulfur.dioxide    0.04039044 -0.081896302  -0.09040281
## total.sulfur.dioxide  -0.02578993 -0.004526871   0.03142581
## 
## $ycoef
##                          [,1]         [,2]          [,3]
## fixed.acidity       0.6300078    0.7382125    0.04156297
## volatile.acidity   -3.7269496    2.7250334    5.25358495
## citric.acid        -6.6286891    0.6789206    0.75975488
## density          -382.1226185 -319.7329865 -344.27947229

The interpretation of canonical correlation is similar to that of linear regression. For instance, consider the acidity set. Suppose that we wanted an interpretation of the influence of fixed.acidity on the first canonical variate, the interpretation would be as follows:

We will implement the comput function to compute the correlations between the variables and the canonical variates, as well as the loadings of the variables on the canonical dimensions.

Usually, the number of canonical dimensions is the same as the number of variables in the smaller set. The number of canonical dimensions that are significant in explaining the relationship between the 2 sets of variables may, however, be smaller than the number of variables in the smaller data set. For this practice, we have 3 dimensions.

#computes the canonical loadings
can_cor2 <- comput(ide,acidity,can_cor1)

can_cor2[3:6] #displays the canonical loadings
## $corr.X.xscores
##                            [,1]       [,2]       [,3]
## chlorides            -0.7627757  0.2716167 -0.5868540
## free.sulfur.dioxide  -0.1479687 -0.9544958 -0.2589267
## total.sulfur.dioxide -0.6006467 -0.7074330  0.3725079
## 
## $corr.Y.xscores
##                        [,1]       [,2]        [,3]
## fixed.acidity    -0.0368851 0.17516149 -0.03066069
## volatile.acidity -0.1137480 0.01498495  0.05033043
## citric.acid      -0.2036616 0.10471705 -0.03413442
## density          -0.2151756 0.06505414 -0.03208948
## 
## $corr.X.yscores
##                             [,1]       [,2]        [,3]
## chlorides            -0.34600754  0.0562354 -0.03575479
## free.sulfur.dioxide  -0.06712102 -0.1976184 -0.01577542
## total.sulfur.dioxide -0.27246318 -0.1464666  0.02269549
## 
## $corr.Y.yscores
##                         [,1]       [,2]       [,3]
## fixed.acidity    -0.08131343 0.84602907 -0.5032430
## volatile.acidity -0.25075819 0.07237725  0.8260885
## citric.acid      -0.44897316 0.50578277 -0.5602585
## density          -0.47435584 0.31421115 -0.5266942

To obtain the statistical significance of the dimensions, we are going to use CCP package.

#test of canonical dimensions
rho <- can_cor1$cor

#defining the number of observations, no of variables in first set and number of variables in second set
n <- dim(ide)[1]
p <- length(ide)
q <- length(acidity)

#Calculating the F approximations using Wilk's Statistics
p.asym(rho, n, p, q, tstat="Wilks")
## Wilks' Lambda, using F-approximation (Rao's F):
##               stat    approx df1      df2      p.value
## 1 to 3:  0.7573653 38.878017  12 4212.328 0.000000e+00
## 2 to 3:  0.9535817 12.770396   6 3186.000 2.675637e-14
## 3 to 3:  0.9962880  2.969489   2 1594.000 5.161358e-02

In the above output, the first test determines whether the combined dimensions from 1 to 3 are significant. Since the p-value is less than the alpha = 0.05 level of significance, it follows that all the 3 dimensions are statistically significant (F = 11.72, p =.00).

Similarly, the second test determines the significance of dimension 2 and 3 combined. Since p < 0.05, it follows that the dimensions are statistically significant.

Lastly, the last test determines the significance of the third dimension, which is not statistically significant due to p > 0.05.

Calculating standardized canonical coefficients using R

When the standard deviations between the variables have a large variance between them, the best practice is often to perform a standardization procedure which aids or eases the comparisons among variables.

The standardization of the first set of canonical coefficients(ide) can be done as follows:

std_coef1 <- diag(sqrt(diag(cov(ide))))

std_coef1 %*% can_cor1$xcoef
##            [,1]       [,2]       [,3]
## [1,] -0.7249126  0.2834400 -0.6305951
## [2,]  0.4224903 -0.8566482 -0.9456276
## [3,] -0.8483682 -0.1489129  1.0337622

The standardization of the second set of canonical coefficients(acidity) can be done as follows:

std_coef2 <- diag(sqrt(diag(cov(acidity))))

std_coef2 %*% can_cor1$ycoef
##            [,1]       [,2]        [,3]
## [1,]  1.0969042  1.2852991  0.07236513
## [2,] -0.6673465  0.4879437  0.94070537
## [3,] -1.2912762  0.1322545  0.14800111
## [4,] -0.7211930 -0.6034429 -0.64977034

Interpreting the standardized canonical coefficients

The interpretation of canonical coefficients follows from the interpretation of the standardized regression coefficients.

For instance, in the acidity set of variables, a unit increase in fixed.acidity value would result in a 1.096 unit standard deviation increase on the first canonical variate when all the other variables in the model are held constant.

Note that when using canonical correlation, the sample size should be large, and the variables should follow a multivariate normal distribution due to the assmption of the method. Lastly, as in other research, the sample should be representative to the population.

Reference: https://medium.com/analytics-vidhya/canonical-correlation-analysis-cca-in-r-a-non-technical-primer-b67d9bdeb9dd