Multivariate hw Canonical Correlation Analysis (CCA)

The following R commands perform a simple CCA on the following data set (download the data set from the “Data Sets” folder). The data come from 2000 US News magazine which ranks 42 US colleges according to some evaluation items. Six variables are Academic (Academic Reputation), Grad_rate (Graduation Rate), Accept_rate (rate of Acceptance for Applications), SAT_P25 (SAT 25th percentile score), SAT_P75 (SAT 75th percentile score), HS_P10 (ratio of High School top 10% students). The data are in the “new_col2000.txt” file that I read next.

col1 <- read.csv('~/Desktop/newdata.csv')
col <- col1[,-1]
rownames(col) <- col1[,1]
head(col)

##           Academic Grad_rate SAT_P25 SAT_P75 HS_P10 Accept_rate
## Caltech        4.7      0.85    1420    1570   1.00        0.18
## Harvard        4.9      0.96    1400    1580   0.90        0.12
## MIT            4.9      0.92    1400    1560   0.95        0.22
## Princeton      4.8      0.95    1360    1540   0.93        0.13
## Yale           4.8      0.94    1360    1540   0.95        0.18
## Stanford       4.9      0.92    1360    1540   0.87        0.13

Then separate the 6 variables into two groups. The first group has 3 variables Academic, Grad_rate, and Accept_rate. The second group has 3 variables SAT_P25, SAT_P75, and HS_P10.

x<-cbind(col[,1],col[,2],col[,6])
y<-cbind(col[,3],col[,4],col[,5])

Standardize all variables and apply CCA to the two groups using the R function “cancor”.

cxy<-cancor(scale(x,scale=T,center=T),scale(y,scale=T,center=T))

Check out the results:

cxy

## $cor
## [1] 0.9138012 0.4853202 0.1060088
## 
## $xcoef
##             [,1]        [,2]        [,3]
## [1,]  0.08974032 -0.19011912 -0.06046737
## [2,]  0.03330972  0.16560726 -0.22826530
## [3,] -0.04986535 -0.04022887 -0.28403564
## 
## $ycoef
##             [,1]        [,2]        [,3]
## [1,] -0.17690301  0.85320991 -0.14228895
## [2,]  0.30493077 -0.84344730  0.01596496
## [3,]  0.03635201  0.05630199  0.16833153
## 
## $xcenter
## [1] -5.815454e-16  4.150119e-16  1.902000e-17
## 
## $ycenter
## [1]  1.060412e-15 -4.753142e-16 -1.752897e-16

In order to determine the number of canonical variates so as to provide a meaningful interpretation of CCA, one can perform the Wilk’s test for checking the “significance of canonical correlations”. However, we then have need to check the assumption of “multivariate normality” for all variables. This can be done by using the package “MVN”:

library(lme4)

## Loading required package: Matrix

library(MVN)

## sROC 0.1-2 loaded

Running (i) Mardia’s; (ii) Henze-Zirkler’s and (iii) Royston’s Multivariate normality Test: (i) Mardia

Mardia <- mvn(data = col ,mvnTest ="mardia",univariateTest = "SW")
Mardia$multivariateNormality

##              Test         Statistic             p value Result
## 1 Mardia Skewness  90.4959189840481 0.00239757245324619     NO
## 2 Mardia Kurtosis 0.231402031484855   0.817002487092545    YES
## 3             MVN              <NA>                <NA>     NO

Henze-Zirkler

Hz <- mvn(data = col ,mvnTest ="hz",univariateTest = c("SW"))
Hz$multivariateNormality

##            Test       HZ      p value MVN
## 1 Henze-Zirkler 1.277839 1.017881e-09  NO

Royston

Roy <- mvn(data = col ,mvnTest ="royston",univariateTest = c("SW"))
Roy$multivariateNormality

##      Test        H    p value MVN
## 1 Royston 10.06986 0.01368027  NO

The above three tests all deny the assumption of multivariate normality!! (what to do?)

Next, we can do the following: (i) check outliers and/or (ii) check normality for each individual variable (then see how to deal with that). Let us check outliers first by using the “adjusted robust Mahalanobis distance”:

result <- mvn(col,multivariateOutlierMethod = 'adj',showOutliers = TRUE,showNewData = TRUE)

result

## $multivariateNormality
##              Test         Statistic             p value Result
## 1 Mardia Skewness  90.4959189840481 0.00239757245324619     NO
## 2 Mardia Kurtosis 0.231402031484855   0.817002487092545    YES
## 3             MVN              <NA>                <NA>     NO
## 
## $univariateNormality
##           Test    Variable Statistic   p value Normality
## 1 Shapiro-Wilk  Academic      0.9459    0.0462    NO    
## 2 Shapiro-Wilk  Grad_rate     0.9236    0.0079    NO    
## 3 Shapiro-Wilk   SAT_P25      0.9733    0.4238    YES   
## 4 Shapiro-Wilk   SAT_P75      0.9775    0.5659    YES   
## 5 Shapiro-Wilk   HS_P10       0.9147    0.0041    NO    
## 6 Shapiro-Wilk Accept_rate    0.9598    0.1453    YES   
## 
## $Descriptives
##              n         Mean      Std.Dev   Median     Min     Max
## Academic    42    4.1690476   0.46721181    4.150    3.30    4.90
## Grad_rate   42    0.8361905   0.08883947    0.845    0.66    0.96
## SAT_P25     42 1230.9761905 103.29934906 1245.000 1000.00 1420.00
## SAT_P75     42 1423.8095238  89.38686489 1420.000 1230.00 1580.00
## HS_P10      42    0.7811905   0.16300834    0.830    0.37    1.00
## Accept_rate 42    0.4009524   0.18537779    0.385    0.12    0.79
##                  25th      75th        Skew   Kurtosis
## Academic       3.8000    4.6000  0.01057444 -1.3068589
## Grad_rate      0.7725    0.9175 -0.44068064 -1.1407772
## SAT_P25     1170.0000 1290.0000 -0.32704835 -0.5237529
## SAT_P75     1355.0000 1487.5000 -0.18436202 -0.8640461
## HS_P10         0.6900    0.9000 -0.75691767 -0.4339846
## Accept_rate    0.2500    0.5475  0.28337625 -0.9868371
## 
## $multivariateOutliers
##                           Observation Mahalanobis Distance Outlier
## UC_Santa_Barbara     UC_Santa_Barbara              162.515    TRUE
## UC_Davis                     UC_Davis              151.050    TRUE
## UC_Irvine                   UC_Irvine              120.248    TRUE
## UC_San_Diego             UC_San_Diego               75.173    TRUE
## UCLA                             UCLA               71.033    TRUE
## Georgia_Tech             Georgia_Tech               45.572    TRUE
## U_C_Berkeley             U_C_Berkeley               35.621    TRUE
## Caltech                       Caltech               24.566    TRUE
## Rice                             Rice               22.439    TRUE
## U_Washington             U_Washington               20.483    TRUE
## Pennsylvania_State Pennsylvania_State               19.724    TRUE
## Emory                           Emory               19.661    TRUE
## U_Florida                   U_Florida               19.038    TRUE
## USC                               USC               18.420    TRUE
## 
## $newData
##                       Academic Grad_rate SAT_P25 SAT_P75 HS_P10
## Boston_College             3.5      0.77    1200    1370   0.60
## Brandeis                   3.7      0.79    1230    1400   0.65
## Brown                      4.5      0.93    1290    1500   0.88
## C_William_and_Mary         3.8      0.89    1210    1400   0.74
## Carnegie_Mellon            4.2      0.75    1270    1470   0.69
## Columbia                   4.6      0.90    1290    1490   0.87
## Cornell                    4.7      0.91    1260    1450   0.82
## Dartmouth                  4.4      0.94    1350    1520   0.88
## Duke                       4.6      0.92    1300    1490   0.89
## Georgetown                 3.9      0.91    1260    1450   0.80
## Harvard                    4.9      0.96    1400    1580   0.90
## Johns_Hopkins              4.7      0.89    1290    1480   0.81
## MIT                        4.9      0.92    1400    1560   0.95
## Northwestern               4.5      0.90    1280    1460   0.86
## Princeton                  4.8      0.95    1360    1540   0.93
## Stanford                   4.9      0.92    1360    1540   0.87
## Tufts                      3.6      0.87    1250    1420   0.69
## Tulane                     3.5      0.72    1170    1350   0.47
## U_Chicago                  4.7      0.84    1250    1460   0.69
## U_NC_Chapel_Hill           4.2      0.82    1130    1340   0.69
## U_Notre_Dame               3.9      0.94    1240    1400   0.84
## U_Pennsylvania             4.4      0.90    1300    1480   0.90
## U_Virginia                 4.3      0.92    1210    1410   0.79
## UT_Austin                  4.0      0.66    1090    1310   0.44
## Vanderbilt                 4.1      0.82    1230    1420   0.63
## Wake_Forest                3.3      0.82    1220    1390   0.64
## Washington_U_St_Louis      4.1      0.86    1250    1420   0.77
## Yale                       4.8      0.94    1360    1540   0.95
##                       Accept_rate
## Boston_College               0.40
## Brandeis                     0.57
## Brown                        0.17
## C_William_and_Mary           0.45
## Carnegie_Mellon              0.42
## Columbia                     0.14
## Cornell                      0.34
## Dartmouth                    0.21
## Duke                         0.28
## Georgetown                   0.24
## Harvard                      0.12
## Johns_Hopkins                0.41
## MIT                          0.22
## Northwestern                 0.33
## Princeton                    0.13
## Stanford                     0.13
## Tufts                        0.33
## Tulane                       0.79
## U_Chicago                    0.61
## U_NC_Chapel_Hill             0.35
## U_Notre_Dame                 0.42
## U_Pennsylvania               0.29
## U_Virginia                   0.34
## UT_Austin                    0.71
## Vanderbilt                   0.59
## Wake_Forest                  0.48
## Washington_U_St_Louis        0.37
## Yale                         0.18

By choosing the parameter “alpha = 0.05”, the program shows 14 potential outliers.

The adjusted chi-square plot is given by:

newcol<- result$newData
Roy <- mvn(data = newcol ,mvnTest ="royston",univariateTest = c("SW"))
Roy$multivariateNormality

##      Test        H    p value MVN
## 1 Royston 6.127891 0.05244425 YES

The new data now agree with the assumption of multivariate normality (by choosing Type I error rate = 0.05). If you don’t feel comfortable with excluding 14 outliers (which is about 33.3% of the original data), you may check out the normality for each individual variable by looking at its Q-Q plot (or histogram) and see what we can do next.

Note that the MVN package provides different types of choices for normality test on each variable (e.g. Shapiro-Wilk, Cramer-von Mises, Lilliefors (Kolmogorov-Smirnov), Shapiro-Francia and Anderson-Darling).

Here we choose the popular KS test:

Lil <- mvn(col, univariateTest = 'Lillie')
Lil$Descriptives

##              n         Mean      Std.Dev   Median     Min     Max
## Academic    42    4.1690476   0.46721181    4.150    3.30    4.90
## Grad_rate   42    0.8361905   0.08883947    0.845    0.66    0.96
## SAT_P25     42 1230.9761905 103.29934906 1245.000 1000.00 1420.00
## SAT_P75     42 1423.8095238  89.38686489 1420.000 1230.00 1580.00
## HS_P10      42    0.7811905   0.16300834    0.830    0.37    1.00
## Accept_rate 42    0.4009524   0.18537779    0.385    0.12    0.79
##                  25th      75th        Skew   Kurtosis
## Academic       3.8000    4.6000  0.01057444 -1.3068589
## Grad_rate      0.7725    0.9175 -0.44068064 -1.1407772
## SAT_P25     1170.0000 1290.0000 -0.32704835 -0.5237529
## SAT_P75     1355.0000 1487.5000 -0.18436202 -0.8640461
## HS_P10         0.6900    0.9000 -0.75691767 -0.4339846
## Accept_rate    0.2500    0.5475  0.28337625 -0.9868371

The KS tests show strong evidence of “non-normality” for the two variables “Grad_rate” and “HS_P10”. One way to deal with this problem is transform these two variables so as to make them become “normal” (e.g. a box-cox transformation may succeed). However, if this does not work, you may think about “removing some non- normal variables” and perform the multivariate normal test again. One should always keep in mind, by removing one variable we lose the “whole column of data information” (in this case, 33.3% of data information). “whole column of data information” (in this case, 33.3% of data information). Therefore, the tradeoff between “model assumption” and “completeness of data information” needs to be seriously taken into account. In this case, we proceed the analysis by removing 14 outliers identified earlier since the new data have already past the multivariate normal test. Note that the default package “stats” does not provide the likelihood ratio test for the significance of eigenvalues (or canonical correlations). Here we need to install the package “CCP”:

library(CCP)

The following command provides “Rao’s Approximate F-test” for testing the significance of canonical correlations. Note that SAS reports the same output, which is almost the same as the chi- squared test.

newx<-cbind(newcol[,1], newcol[,2], newcol[,6])
newy<-cbind(newcol[,3], newcol[,4], newcol[,5])

newcxy<-cancor(scale(newx, scale=T, center=T),scale(newy, scale=T, center=T))
rho<-newcxy$cor

p.asym(rho,28,3,3,tstat="Wilks")

## Wilks' Lambda, using F-approximation (Rao's F):
##                stat    approx df1      df2      p.value
## 1 to 3:  0.02265883 22.314909   9 53.69282 3.774758e-15
## 2 to 3:  0.48470754  5.018019   4 46.00000 1.929350e-03
## 3 to 3:  0.99555919  0.107055   1 24.00000 7.463590e-01

If the significance level is chosen to be 5%, then the first two canonical correlations are significant. Check out the result again:

newcxy

## $cor
## [1] 0.97634654 0.71633118 0.06663944
## 
## $xcoef
##             [,1]        [,2]       [,3]
## [1,] -0.07542668  0.21667817 0.09637717
## [2,] -0.08301614 -0.22577702 0.24275370
## [3,]  0.05916659 -0.02854145 0.33801594
## 
## $ycoef
##            [,1]       [,2]       [,3]
## [1,]  0.1404316 -0.6790589 -0.6334100
## [2,] -0.1706895  0.8781599  0.3215675
## [3,] -0.1628499 -0.2073210  0.2585776
## 
## $xcenter
## [1] -7.657565e-16  4.658972e-17 -1.028443e-16
## 
## $ycenter
## [1] -1.309221e-15  1.420491e-15  4.308310e-16

Let us visualize the result in the first canonical variate space:

xx<-scale(newx,scale=T,center=T) 
yy<-scale(newy,scale=T,center=T) 
scorex<-xx%*%newcxy$xcoef[,1] 
scorey<-yy%*%newcxy$ycoef[,1] 
plot(scorex,scorey,type="n")
text(scorex,scorey,row.names(newcol),cex=.6)

The above plot shows strong linear correlation (0.934) between the first two canonical variates. Question: How do you interpret the result?

Multivariate hw Canonical Correlation Analysis (CCA)

106354003 林健宏

Multivariate hw
Canonical Correlation Analysis (CCA)