Question: Canonical correlation analysis by utilizing suitable software

Being X# (1) and X# (2) the primary and secondary variables respectively,

the following covariance sample matrix:

X1(1) X2(1) X3(1) X1(2) X2(2)
1106.000 396.700 108.400 0.787 26.230
396.700 2382.000 1143.000 -0.214 -23.960
108.400 1143.000 2136.000 2.189 -20.840
0.787 -0.214 2.189 0.016 0.216
26.230 -23.960 -20.840 0.216 70.560

The canonical correlations for this matrix are:

## [1] 0.5173449 0.1255082 0.0000000

a) Test at the 5% level if there is any association between the groups of variables.

In order to know the association between the group of variables it is possible to test the hypothesis,

H0: \(\Sigma\)12 = 0 (\(\rho\)i= 0)

To test this hypothesis the formula (10-39) (Chapter 10 Int. Ed.)

\(-(n-1-1/2*(p+q+1))ln\)\(\Pi\)i=1p\((1-\)\(\rho\)i2\()\) \(\geqslant\) \(\chi\)2pq\((\) \(\alpha\) \()\)

Being p = 3, q = 2 and n = 46. The result is the following:

## 13.7494849041102 >= 12.591587243744

The null hipothesis is rejected, therefore the canonical correlations are not 0. It is factible to pursue a canonical analysis.

b) How many pairs of canonical variates are signifcant?

In order to know the significant canonical variates it is possible to test the hypothesis,

H0(1): \(\rho\)1* \(\neq\) 0 (\(\rho\)i= 0)

To test this hypothesis the formula (10-40) (Chapter 10 Int. Ed.)

\(-(n-1-1/2*(p+q+1))ln\)\(\Pi\)i=k+1p\((1-\)\(\rho\)i2\()\) \(\geqslant\) \(\chi\)2(p-1)(q-1)\((\) \(\alpha\) \()\)

The result is the following:

## 0.666863241993919 >= 5.99146454710798

The null hypothesis can not be rejected, therefore only the first pair of canonical variables are important.

c) Interpret the “significant” squared canonical correlations.

The canonical correlations here (0.5173449 0.1255082 0.0000000) are also the multiple correlation coefficients of Uk with X(1) or Vk with X(2). The kth squared canonical correlation ρk^2 is the proportion of canonical variate Uk explained by set X(2) or Vk by set X(1). Here only ρk^2 with k=1 is significant and is

## [1] 0.01575231

It is also sometimes regarded as a measure of set overlap.

d) Interpret the canonical variates by using the coeffcients and suitable correlations.

The raw canonical coefficients are obtained via the following equations (p.545) (Chapter 10 Int. Ed.)

ak = \(\Sigma\)11-(1/2)ek

bk = \(\Sigma\)22-(1/2)fk

## raw coefficients for the conanical values of S11 
##  0.01310065 -0.01443825 0.02339972
## raw coefficients for the conanical values of S22 
##  -8.065575 0.01915905

The values above are calculated with the help of the eigen values ek and fk and from the given sample covariance matrix. Both outputs are raw and are not normalized. As can be seen from the outputs above, the first output is dominated mostly by the insulin responses, whereas the second output is influenced mostly by the relative weight.

e) Are the “significant” canonical variates good summary measures of the respective data sets?

From the null hypothesis in b we can see that only the first pair of canonical variables are significant. Therefore we compute the proportion of sample variance represented by the canonical variables.

No, as only 22.94% of the variance of first variables is described and only 32.58% of variance the second variables group is described.

f) Give your opinion on the success of this canonical correlation analysis.

As seen in part e we cannot judge with a high confidence that the correlation value received in part a is actually true. This is because only 22.94% of the variance of the first variable group and only 32.58% of the second variable group is captured respectively. Therefore, we disagree that the analysis was a success.

Appendix

knitr::opts_chunk$set(echo = TRUE)
library(expm)
library(kableExtra)
# Reading data
S <- as.matrix(read.table("P10-16.DAT")) #sample covariance matrix
colnames(S) <- c("X1(1)", "X2(1)","X3(1)", "X1(2)", "X2(2)")

n <- 46 #non diabetic patients yield the covariance matrix
p <- 3
q <- 2

# Submatrices
S11 <- as.matrix(S[1:3, 1:3])
S22 <- as.matrix(S[4:5,4:5])
S12 <- as.matrix(S[1:3,4:5])
S21 <- as.matrix(S[4:5,1:3])
 
P11 <- cov2cor(S11)
P22 <- cov2cor(S22)

#Eigen values and vectors
eigen <- eigen(solve(sqrtm(S11)) %*% S12 %*% solve(S22) %*% S21 %*% solve(sqrtm(S11)))
kable(S, format = "markdown", align = "c")
c(sqrt(eigen$values[1]), sqrt(eigen$values[2]), 0)

a <- paste0(-(n-1-0.5*(p+q+1))*log((1-eigen$values[1])*(1-eigen$values[2]))) #13.74948
b <- paste0(qchisq(0.95, 6))
cat(a,">=",b)

a <- paste0(-(n-1-0.5*(p+q+1))*log((1-eigen$values[2]))) #13.74948
b <- paste0(qchisq(0.95, 2))
cat(a,">=",b)
eigen$values[2]
#eigen values for the top left S matrix
eigen = eigen(solve(sqrtm(S11)) %*% S12 %*% solve(S22) %*% S21 %*% solve(sqrtm(S11)))
#eigen values for the bottom right S matrix
eigen2 = eigen(solve(sqrtm(S22)) %*% S21 %*% solve(S11) %*% S12 %*% solve(sqrtm(S22))) 
#raw coefficients for the cononical values 

# Calculation of the coefficient vector a1 and b1 (a1 = con_coef1 and b1 = con_coef2)
# can also be written as: con_coef1 <- solve(sqrtm(S11)) %*% eigen$vectors[,1]
con_coef1 = t(as.matrix(eigen$vectors[,1]))%*%solve(sqrtm(S11))
con_coef2 = t(as.matrix(eigen2$vectors[,1]))%*%solve(sqrtm(S22))
cat("raw coefficients for the conanical values of S11","\n",con_coef1)

cat("raw coefficients for the conanical values of S22","\n",con_coef2)

# Computing correlations between canonical variates and their componentvariables

cor_can_compU <- con_coef1 %*% P11
cor_can_compV <- con_coef2 %*% P22

# Computing the sample correlations

r_ux1 <- con_coef1%*%S11 %*%solve(sqrtm(diag(diag(S11))))
r_vx1 <- con_coef2%*%S22 %*%solve(sqrtm(diag(diag(S22))))

prop_sample_var_explained_by_con_var_1 <- 1/3 * sum(r_ux1^2)
prop_sample_var_explained_by_con_var_2 <- 1/3 * sum(r_vx1^2)