The following R commands perform a simple CCA on the following data set (download the data set from the “Data Sets” folder). The data come from 2000 US News magazine which ranks 42 US colleges according to some evaluation items. Six variables are Academic (Academic Reputation), Grad_rate (Graduation Rate), Accept_rate (rate of Acceptance for Applications), SAT_P25 (SAT 25th percentile score), SAT_P75 (SAT 75th percentile score), HS_P10 (ratio of High School top 10% students). The data are in the “new_col2000.txt” file that I read next.
col1 <- read.csv('~/Desktop/newdata.csv')
col <- col1[,-1]
rownames(col) <- col1[,1]
head(col)
## Academic Grad_rate SAT_P25 SAT_P75 HS_P10 Accept_rate
## Caltech 4.7 0.85 1420 1570 1.00 0.18
## Harvard 4.9 0.96 1400 1580 0.90 0.12
## MIT 4.9 0.92 1400 1560 0.95 0.22
## Princeton 4.8 0.95 1360 1540 0.93 0.13
## Yale 4.8 0.94 1360 1540 0.95 0.18
## Stanford 4.9 0.92 1360 1540 0.87 0.13
Then separate the 6 variables into two groups. The first group has 3 variables Academic, Grad_rate, and Accept_rate. The second group has 3 variables SAT_P25, SAT_P75, and HS_P10.
x<-cbind(col[,1],col[,2],col[,6])
y<-cbind(col[,3],col[,4],col[,5])
Standardize all variables and apply CCA to the two groups using the R function “cancor”.
cxy<-cancor(scale(x,scale=T,center=T),scale(y,scale=T,center=T))
Check out the results:
cxy
## $cor
## [1] 0.9138012 0.4853202 0.1060088
##
## $xcoef
## [,1] [,2] [,3]
## [1,] 0.08974032 -0.19011912 -0.06046737
## [2,] 0.03330972 0.16560726 -0.22826530
## [3,] -0.04986535 -0.04022887 -0.28403564
##
## $ycoef
## [,1] [,2] [,3]
## [1,] -0.17690301 0.85320991 -0.14228895
## [2,] 0.30493077 -0.84344730 0.01596496
## [3,] 0.03635201 0.05630199 0.16833153
##
## $xcenter
## [1] -5.815454e-16 4.150119e-16 1.902000e-17
##
## $ycenter
## [1] 1.060412e-15 -4.753142e-16 -1.752897e-16
In order to determine the number of canonical variates so as to provide a meaningful interpretation of CCA, one can perform the Wilk’s test for checking the “significance of canonical correlations”. However, we then have need to check the assumption of “multivariate normality” for all variables. This can be done by using the package “MVN”:
library(lme4)
## Loading required package: Matrix
library(MVN)
## sROC 0.1-2 loaded
Running (i) Mardia’s; (ii) Henze-Zirkler’s and (iii) Royston’s Multivariate normality Test: (i) Mardia
Mardia <- mvn(data = col ,mvnTest ="mardia",univariateTest = "SW")
Mardia$multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 90.4959189840481 0.00239757245324619 NO
## 2 Mardia Kurtosis 0.231402031484855 0.817002487092545 YES
## 3 MVN <NA> <NA> NO
Hz <- mvn(data = col ,mvnTest ="hz",univariateTest = c("SW"))
Hz$multivariateNormality
## Test HZ p value MVN
## 1 Henze-Zirkler 1.277839 1.017881e-09 NO
Roy <- mvn(data = col ,mvnTest ="royston",univariateTest = c("SW"))
Roy$multivariateNormality
## Test H p value MVN
## 1 Royston 10.06986 0.01368027 NO
The above three tests all deny the assumption of multivariate normality!! (what to do?)
Next, we can do the following: (i) check outliers and/or (ii) check normality for each individual variable (then see how to deal with that). Let us check outliers first by using the “adjusted robust Mahalanobis distance”:
result <- mvn(col,multivariateOutlierMethod = 'adj',showOutliers = TRUE,showNewData = TRUE)
result
## $multivariateNormality
## Test Statistic p value Result
## 1 Mardia Skewness 90.4959189840481 0.00239757245324619 NO
## 2 Mardia Kurtosis 0.231402031484855 0.817002487092545 YES
## 3 MVN <NA> <NA> NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk Academic 0.9459 0.0462 NO
## 2 Shapiro-Wilk Grad_rate 0.9236 0.0079 NO
## 3 Shapiro-Wilk SAT_P25 0.9733 0.4238 YES
## 4 Shapiro-Wilk SAT_P75 0.9775 0.5659 YES
## 5 Shapiro-Wilk HS_P10 0.9147 0.0041 NO
## 6 Shapiro-Wilk Accept_rate 0.9598 0.1453 YES
##
## $Descriptives
## n Mean Std.Dev Median Min Max
## Academic 42 4.1690476 0.46721181 4.150 3.30 4.90
## Grad_rate 42 0.8361905 0.08883947 0.845 0.66 0.96
## SAT_P25 42 1230.9761905 103.29934906 1245.000 1000.00 1420.00
## SAT_P75 42 1423.8095238 89.38686489 1420.000 1230.00 1580.00
## HS_P10 42 0.7811905 0.16300834 0.830 0.37 1.00
## Accept_rate 42 0.4009524 0.18537779 0.385 0.12 0.79
## 25th 75th Skew Kurtosis
## Academic 3.8000 4.6000 0.01057444 -1.3068589
## Grad_rate 0.7725 0.9175 -0.44068064 -1.1407772
## SAT_P25 1170.0000 1290.0000 -0.32704835 -0.5237529
## SAT_P75 1355.0000 1487.5000 -0.18436202 -0.8640461
## HS_P10 0.6900 0.9000 -0.75691767 -0.4339846
## Accept_rate 0.2500 0.5475 0.28337625 -0.9868371
##
## $multivariateOutliers
## Observation Mahalanobis Distance Outlier
## UC_Santa_Barbara UC_Santa_Barbara 162.515 TRUE
## UC_Davis UC_Davis 151.050 TRUE
## UC_Irvine UC_Irvine 120.248 TRUE
## UC_San_Diego UC_San_Diego 75.173 TRUE
## UCLA UCLA 71.033 TRUE
## Georgia_Tech Georgia_Tech 45.572 TRUE
## U_C_Berkeley U_C_Berkeley 35.621 TRUE
## Caltech Caltech 24.566 TRUE
## Rice Rice 22.439 TRUE
## U_Washington U_Washington 20.483 TRUE
## Pennsylvania_State Pennsylvania_State 19.724 TRUE
## Emory Emory 19.661 TRUE
## U_Florida U_Florida 19.038 TRUE
## USC USC 18.420 TRUE
##
## $newData
## Academic Grad_rate SAT_P25 SAT_P75 HS_P10
## Boston_College 3.5 0.77 1200 1370 0.60
## Brandeis 3.7 0.79 1230 1400 0.65
## Brown 4.5 0.93 1290 1500 0.88
## C_William_and_Mary 3.8 0.89 1210 1400 0.74
## Carnegie_Mellon 4.2 0.75 1270 1470 0.69
## Columbia 4.6 0.90 1290 1490 0.87
## Cornell 4.7 0.91 1260 1450 0.82
## Dartmouth 4.4 0.94 1350 1520 0.88
## Duke 4.6 0.92 1300 1490 0.89
## Georgetown 3.9 0.91 1260 1450 0.80
## Harvard 4.9 0.96 1400 1580 0.90
## Johns_Hopkins 4.7 0.89 1290 1480 0.81
## MIT 4.9 0.92 1400 1560 0.95
## Northwestern 4.5 0.90 1280 1460 0.86
## Princeton 4.8 0.95 1360 1540 0.93
## Stanford 4.9 0.92 1360 1540 0.87
## Tufts 3.6 0.87 1250 1420 0.69
## Tulane 3.5 0.72 1170 1350 0.47
## U_Chicago 4.7 0.84 1250 1460 0.69
## U_NC_Chapel_Hill 4.2 0.82 1130 1340 0.69
## U_Notre_Dame 3.9 0.94 1240 1400 0.84
## U_Pennsylvania 4.4 0.90 1300 1480 0.90
## U_Virginia 4.3 0.92 1210 1410 0.79
## UT_Austin 4.0 0.66 1090 1310 0.44
## Vanderbilt 4.1 0.82 1230 1420 0.63
## Wake_Forest 3.3 0.82 1220 1390 0.64
## Washington_U_St_Louis 4.1 0.86 1250 1420 0.77
## Yale 4.8 0.94 1360 1540 0.95
## Accept_rate
## Boston_College 0.40
## Brandeis 0.57
## Brown 0.17
## C_William_and_Mary 0.45
## Carnegie_Mellon 0.42
## Columbia 0.14
## Cornell 0.34
## Dartmouth 0.21
## Duke 0.28
## Georgetown 0.24
## Harvard 0.12
## Johns_Hopkins 0.41
## MIT 0.22
## Northwestern 0.33
## Princeton 0.13
## Stanford 0.13
## Tufts 0.33
## Tulane 0.79
## U_Chicago 0.61
## U_NC_Chapel_Hill 0.35
## U_Notre_Dame 0.42
## U_Pennsylvania 0.29
## U_Virginia 0.34
## UT_Austin 0.71
## Vanderbilt 0.59
## Wake_Forest 0.48
## Washington_U_St_Louis 0.37
## Yale 0.18
By choosing the parameter “alpha = 0.05”, the program shows 14 potential outliers.
The adjusted chi-square plot is given by:
newcol<- result$newData
Roy <- mvn(data = newcol ,mvnTest ="royston",univariateTest = c("SW"))
Roy$multivariateNormality
## Test H p value MVN
## 1 Royston 6.127891 0.05244425 YES
The new data now agree with the assumption of multivariate normality (by choosing Type I error rate = 0.05). If you don’t feel comfortable with excluding 14 outliers (which is about 33.3% of the original data), you may check out the normality for each individual variable by looking at its Q-Q plot (or histogram) and see what we can do next.
Note that the MVN package provides different types of choices for normality test on each variable (e.g. Shapiro-Wilk, Cramer-von Mises, Lilliefors (Kolmogorov-Smirnov), Shapiro-Francia and Anderson-Darling).
Here we choose the popular KS test:
Lil <- mvn(col, univariateTest = 'Lillie')
Lil$Descriptives
## n Mean Std.Dev Median Min Max
## Academic 42 4.1690476 0.46721181 4.150 3.30 4.90
## Grad_rate 42 0.8361905 0.08883947 0.845 0.66 0.96
## SAT_P25 42 1230.9761905 103.29934906 1245.000 1000.00 1420.00
## SAT_P75 42 1423.8095238 89.38686489 1420.000 1230.00 1580.00
## HS_P10 42 0.7811905 0.16300834 0.830 0.37 1.00
## Accept_rate 42 0.4009524 0.18537779 0.385 0.12 0.79
## 25th 75th Skew Kurtosis
## Academic 3.8000 4.6000 0.01057444 -1.3068589
## Grad_rate 0.7725 0.9175 -0.44068064 -1.1407772
## SAT_P25 1170.0000 1290.0000 -0.32704835 -0.5237529
## SAT_P75 1355.0000 1487.5000 -0.18436202 -0.8640461
## HS_P10 0.6900 0.9000 -0.75691767 -0.4339846
## Accept_rate 0.2500 0.5475 0.28337625 -0.9868371
The KS tests show strong evidence of “non-normality” for the two variables “Grad_rate” and “HS_P10”. One way to deal with this problem is transform these two variables so as to make them become “normal” (e.g. a box-cox transformation may succeed). However, if this does not work, you may think about “removing some non- normal variables” and perform the multivariate normal test again. One should always keep in mind, by removing one variable we lose the “whole column of data information” (in this case, 33.3% of data information). “whole column of data information” (in this case, 33.3% of data information). Therefore, the tradeoff between “model assumption” and “completeness of data information” needs to be seriously taken into account. In this case, we proceed the analysis by removing 14 outliers identified earlier since the new data have already past the multivariate normal test. Note that the default package “stats” does not provide the likelihood ratio test for the significance of eigenvalues (or canonical correlations). Here we need to install the package “CCP”:
library(CCP)
The following command provides “Rao’s Approximate F-test” for testing the significance of canonical correlations. Note that SAS reports the same output, which is almost the same as the chi- squared test.
newx<-cbind(newcol[,1], newcol[,2], newcol[,6])
newy<-cbind(newcol[,3], newcol[,4], newcol[,5])
newcxy<-cancor(scale(newx, scale=T, center=T),scale(newy, scale=T, center=T))
rho<-newcxy$cor
p.asym(rho,28,3,3,tstat="Wilks")
## Wilks' Lambda, using F-approximation (Rao's F):
## stat approx df1 df2 p.value
## 1 to 3: 0.02265883 22.314909 9 53.69282 3.774758e-15
## 2 to 3: 0.48470754 5.018019 4 46.00000 1.929350e-03
## 3 to 3: 0.99555919 0.107055 1 24.00000 7.463590e-01
If the significance level is chosen to be 5%, then the first two canonical correlations are significant. Check out the result again:
newcxy
## $cor
## [1] 0.97634654 0.71633118 0.06663944
##
## $xcoef
## [,1] [,2] [,3]
## [1,] -0.07542668 0.21667817 0.09637717
## [2,] -0.08301614 -0.22577702 0.24275370
## [3,] 0.05916659 -0.02854145 0.33801594
##
## $ycoef
## [,1] [,2] [,3]
## [1,] 0.1404316 -0.6790589 -0.6334100
## [2,] -0.1706895 0.8781599 0.3215675
## [3,] -0.1628499 -0.2073210 0.2585776
##
## $xcenter
## [1] -7.657565e-16 4.658972e-17 -1.028443e-16
##
## $ycenter
## [1] -1.309221e-15 1.420491e-15 4.308310e-16
Let us visualize the result in the first canonical variate space:
xx<-scale(newx,scale=T,center=T)
yy<-scale(newy,scale=T,center=T)
scorex<-xx%*%newcxy$xcoef[,1]
scorey<-yy%*%newcxy$ycoef[,1]
plot(scorex,scorey,type="n")
text(scorex,scorey,row.names(newcol),cex=.6)
The above plot shows strong linear correlation (0.934) between the first two canonical variates. Question: How do you interpret the result?