Ever been in a scenario where you needed to come up with pairwise correlation, covariance, or cosine matrices for data on the fly without the help of a function? Probably not. Even so, it’s worth taking a look at how these matrices could be calculated as there are some neat commonalities in their respective calculations. Playing a role in it all is the \(X^tX\) matrix, an old friend we met a few posts back when illustrating multiple regression from an algebraic viewpoint. Here, you can think of \(X\) as some data you’ve collected where every column is a different measure and every row a different subject.
Taking the data \(X\) as is and multiplying its transpose by itself \(X^tX\) results in the sum of squares cross products matrix (SSCP) where SS fall on the diagonal and cross proudcts on the off diagonal.
Centering \(X\), multiplying its transpose by itself \(X^tX\) and dividing by n - 1 (where n = # of rows in X) results in the variance covariance matrix with variances on the diagonal and covariances on the off diagonal.
Standardizing \(X\), multiplying its transpose by itself \(X^tX\) and dividing by n - 1 (where n = # of rows in X) results in the pearson correlation between variable pairs.
Unit-scaling \(X\) and multiplying its transpose by itself \(X^tX\) results in the cosine similarity between variable pairs.
Below, we will illustrate what was just discussed by creating a function that outputs one of the aforementiond matrices based on user input.
my_fun<-function(x,type =c("ss","cov","cor","cosine")){
type<-match.arg(type)
out<-matrix(rep(0,dim(x)[2]*dim(x)[2]),nrow=dim(x)[2])
if (type == "cov"){
x<-scale(x,center=TRUE,scale=FALSE)
out<-t(x)%*%(x)/(nrow(x)-1)
} else if (type =="cor"){
x<-scale(x,center=TRUE,scale=TRUE)
out<-t(x)%*%(x)/(nrow(x)-1)
} else if (type == "cosine"){
x<-apply(x, 2,function(x) x / sqrt(sum(x^2)) )
out<-t(x)%*%(x)
} else {
out<-t(x)%*%(x)
}
return(out)
}
Now, that we have our function let’s make some data to apply it to.
dat<-cbind(replicate(4,sample(1:20,100,rep=TRUE)))
We can now compare the output of our function to the complement functions in base R. Starting with the \(SSCP\) matrix
crossprod(dat)
[,1] [,2] [,3] [,4]
[1,] 16498 11696 12111 12395
[2,] 11696 12972 9451 10356
[3,] 12111 9451 13220 9592
[4,] 12395 10356 9592 14279
my_fun(dat, type = "ss")
[,1] [,2] [,3] [,4]
[1,] 16498 11696 12111 12395
[2,] 11696 12972 9451 10356
[3,] 12111 9451 13220 9592
[4,] 12395 10356 9592 14279
cov(dat)
[,1] [,2] [,3] [,4]
[1,] 33.988283 3.772929 8.890909 1.688687
[2,] 3.772929 32.429899 -2.337374 -1.878384
[3,] 8.890909 -2.337374 36.525253 -8.733333
[4,] 1.688687 -1.878384 -8.733333 29.233434
my_fun(dat, type = "cov")
[,1] [,2] [,3] [,4]
[1,] 33.988283 3.772929 8.890909 1.688687
[2,] 3.772929 32.429899 -2.337374 -1.878384
[3,] 8.890909 -2.337374 36.525253 -8.733333
[4,] 1.688687 -1.878384 -8.733333 29.233434
cor(dat)
[,1] [,2] [,3] [,4]
[1,] 1.00000000 0.11364262 0.25233933 0.05357283
[2,] 0.11364262 1.00000000 -0.06791391 -0.06100587
[3,] 0.25233933 -0.06791391 1.00000000 -0.26726587
[4,] 0.05357283 -0.06100587 -0.26726587 1.00000000
my_fun(dat, type = "cor")
[,1] [,2] [,3] [,4]
[1,] 1.00000000 0.11364262 0.25233933 0.05357283
[2,] 0.11364262 1.00000000 -0.06791391 -0.06100587
[3,] 0.25233933 -0.06791391 1.00000000 -0.26726587
[4,] 0.05357283 -0.06100587 -0.26726587 1.00000000
We’ll load the library “philentropy” to check our work here as it contains many useful distance functions. Note that we are transposing our data as the default behavior of this function is to make pairwise comparisons of all rows.
library(philentropy)
distance(t(dat), method = "cosine")
v1 v2 v3 v4
v1 1.0000000 0.7994996 0.8200657 0.8075734
v2 0.7994996 1.0000000 0.7217031 0.7609212
v3 0.8200657 0.7217031 1.0000000 0.6981433
v4 0.8075734 0.7609212 0.6981433 1.0000000
my_fun(dat, type = "cosine")
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.7994996 0.8200657 0.8075734
[2,] 0.7994996 1.0000000 0.7217031 0.7609212
[3,] 0.8200657 0.7217031 1.0000000 0.6981433
[4,] 0.8075734 0.7609212 0.6981433 1.0000000