Question: Canonical correlation analysis by utilizing suitable software

Being X# (1) and X# (2) the primary and secondary variables respectively,

X1(1): glucose intolerance
X2(1): insulin response to oral glucose
X3(1): insulin resistance
X1(2): relative weight
X2(2): fasting plasma glucose

the following covariance sample matrix:

X1(1)	X2(1)	X3(1)	X1(2)	X2(2)
1106.000	396.700	108.400	0.787	26.230
396.700	2382.000	1143.000	-0.214	-23.960
108.400	1143.000	2136.000	2.189	-20.840
0.787	-0.214	2.189	0.016	0.216
26.230	-23.960	-20.840	0.216	70.560

The canonical correlations for this matrix are:

## [1] 0.5173449 0.1255082 0.0000000

a) Test at the 5% level if there is any association between the groups of variables.

In order to know the association between the group of variables it is possible to test the hypothesis,

H₀: \(\Sigma\)₁₂ = 0 (\(\rho\)_i= 0)

To test this hypothesis the formula (10-39) (Chapter 10 Int. Ed.)

\(-(n-1-1/2*(p+q+1))ln\)\(\Pi\)_i=1^p\((1-\)\(\rho\)_i²\()\) \(\geqslant\) \(\chi\)²_pq\((\) \(\alpha\) \()\)

Being p = 3, q = 2 and n = 46. The result is the following:

## 13.7494849041102 >= 12.591587243744

The null hipothesis is rejected, therefore the canonical correlations are not 0. It is factible to pursue a canonical analysis.

b) How many pairs of canonical variates are signifcant?

In order to know the significant canonical variates it is possible to test the hypothesis,

H₀⁽¹⁾: \(\rho\)₁^* \(\neq\) 0 (\(\rho\)_i= 0)

To test this hypothesis the formula (10-40) (Chapter 10 Int. Ed.)

\(-(n-1-1/2*(p+q+1))ln\)\(\Pi\)_i=k+1^p\((1-\)\(\rho\)_i²\()\) \(\geqslant\) \(\chi\)²_(p-1)(q-1)\((\) \(\alpha\) \()\)

The result is the following:

## 0.666863241993919 >= 5.99146454710798

The null hypothesis can not be rejected, therefore only the first pair of canonical variables are important.

c) Interpret the “significant” squared canonical correlations.

The canonical correlations here (0.5173449 0.1255082 0.0000000) are also the multiple correlation coefficients of U_k with X(1) or V_k with X(2). The kth squared canonical correlation ρ_k^2 is the proportion of canonical variate U_k explained by set X(2) or V_k by set X(1). Here only ρ_k^2 with k=1 is significant and is

## [1] 0.01575231

It is also sometimes regarded as a measure of set overlap.

d) Interpret the canonical variates by using the coeffcients and suitable correlations.

The raw canonical coefficients are obtained via the following equations (p.545) (Chapter 10 Int. Ed.)

a_k = \(\Sigma\)₁₁^-(1/2)e_k

b_k = \(\Sigma\)₂₂^-(1/2)f_k

## raw coefficients for the conanical values of S11 
##  0.01310065 -0.01443825 0.02339972

## raw coefficients for the conanical values of S22 
##  -8.065575 0.01915905

The values above are calculated with the help of the eigen values e_k and f_k and from the given sample covariance matrix. Both outputs are raw and are not normalized. As can be seen from the outputs above, the first output is dominated mostly by the insulin responses, whereas the second output is influenced mostly by the relative weight.

e) Are the “significant” canonical variates good summary measures of the respective data sets?

From the null hypothesis in b we can see that only the first pair of canonical variables are significant. Therefore we compute the proportion of sample variance represented by the canonical variables.

No, as only 22.94% of the variance of first variables is described and only 32.58% of variance the second variables group is described.

f) Give your opinion on the success of this canonical correlation analysis.

As seen in part e we cannot judge with a high confidence that the correlation value received in part a is actually true. This is because only 22.94% of the variance of the first variable group and only 32.58% of the second variable group is captured respectively. Therefore, we disagree that the analysis was a success.

Appendix

knitr::opts_chunk$set(echo = TRUE)
library(expm)
library(kableExtra)
# Reading data
S <- as.matrix(read.table("P10-16.DAT")) #sample covariance matrix
colnames(S) <- c("X1(1)", "X2(1)","X3(1)", "X1(2)", "X2(2)")

n <- 46 #non diabetic patients yield the covariance matrix
p <- 3
q <- 2

# Submatrices
S11 <- as.matrix(S[1:3, 1:3])
S22 <- as.matrix(S[4:5,4:5])
S12 <- as.matrix(S[1:3,4:5])
S21 <- as.matrix(S[4:5,1:3])
 
P11 <- cov2cor(S11)
P22 <- cov2cor(S22)

#Eigen values and vectors
eigen <- eigen(solve(sqrtm(S11)) %*% S12 %*% solve(S22) %*% S21 %*% solve(sqrtm(S11)))
kable(S, format = "markdown", align = "c")
c(sqrt(eigen$values[1]), sqrt(eigen$values[2]), 0)

a <- paste0(-(n-1-0.5*(p+q+1))*log((1-eigen$values[1])*(1-eigen$values[2]))) #13.74948
b <- paste0(qchisq(0.95, 6))
cat(a,">=",b)

a <- paste0(-(n-1-0.5*(p+q+1))*log((1-eigen$values[2]))) #13.74948
b <- paste0(qchisq(0.95, 2))
cat(a,">=",b)
eigen$values[2]
#eigen values for the top left S matrix
eigen = eigen(solve(sqrtm(S11)) %*% S12 %*% solve(S22) %*% S21 %*% solve(sqrtm(S11)))
#eigen values for the bottom right S matrix
eigen2 = eigen(solve(sqrtm(S22)) %*% S21 %*% solve(S11) %*% S12 %*% solve(sqrtm(S22))) 
#raw coefficients for the cononical values 

# Calculation of the coefficient vector a1 and b1 (a1 = con_coef1 and b1 = con_coef2)
# can also be written as: con_coef1 <- solve(sqrtm(S11)) %*% eigen$vectors[,1]
con_coef1 = t(as.matrix(eigen$vectors[,1]))%*%solve(sqrtm(S11))
con_coef2 = t(as.matrix(eigen2$vectors[,1]))%*%solve(sqrtm(S22))
cat("raw coefficients for the conanical values of S11","\n",con_coef1)

cat("raw coefficients for the conanical values of S22","\n",con_coef2)

# Computing correlations between canonical variates and their componentvariables

cor_can_compU <- con_coef1 %*% P11
cor_can_compV <- con_coef2 %*% P22

# Computing the sample correlations

r_ux1 <- con_coef1%*%S11 %*%solve(sqrtm(diag(diag(S11))))
r_vx1 <- con_coef2%*%S22 %*%solve(sqrtm(diag(diag(S22))))

prop_sample_var_explained_by_con_var_1 <- 1/3 * sum(r_ux1^2)
prop_sample_var_explained_by_con_var_2 <- 1/3 * sum(r_vx1^2)

Assignment 4

Group 10: Oriol Garrobé, Björn Kurt Hansen, Erik Anders, Kai Takac

12/12/2019

Question: Canonical correlation analysis by utilizing suitable software

Appendix