CorrelationStudy1.utf8.md

Interpreting Correlation: 0.8 Correlation is high but what it is like?

Neeraj Bhatnagar

June 7, 2019

Although it is usually possible to compute the correlation coefficient between two vectors of equal length, it is not easy to interpret a given value of this coefficient. This article empirically demonstrates that under certain circumstances, an inverse linear relationship exists between the correlation coefficient and classification error. This relationship makes it possible to interpret every possible correlation value in terms of a corresponding rate of classification error.

Introduction

The correlation coefficient between two vectors of equal lengths with corresponding elements is a widely used statistical measure. Though a correlation value between two such vectors with non-zero variances can always be computed and is in the range [-1, 1], it is difficult to assign a meaning to a specific value of correlation, say, 0.8 or -0.3. Correlation values above 0.8 are deemed to indicate a strong positive linear relationship between the variables. Values between 0 and 0.3 indicate a weak relationship or none¹. Interpretations such as these come handy, but they are subjective and tell very little about specific correlation values.

This article presents a reasonable and objective interpretation of every legal value of correlation. By the end of this article, the reader might agree that under some reasonable circumstances 0.8 correlation is related to a 10% classification error, correlation -0.6 means 80% disagreement between the classifiers, and there is an inverse linear relationship between correlation and classification error.

We treat real-valued vectors as classification vectors where positive values indicate membership in a class, and negative values indicate otherwise. The magnitudes of the components do not mean much except that their probability distribution is of interest to us. Given a probability p, we define a transformation ce(v, p) of the vector v and show that when we choose v appropriately and increase p from 0 to 1 in steps, the expected value of the correlation between v and ce(v, p) declines linearly and predictably from 1 to -1. This systematic pattern of change allows us to use p to as an interpreter of the correlation coefficient. We, subsequently, link p to the well-understood concept of classification error.

Building an interpretation of the correlation

v is a vector of finite length and finite, non-zero variance. The inverse vector -v of v inverts the sign of every element of v. The correlation of v with itself, cor(v, v), is 1.0 due to the perfect linear relationship of v with itself. The correlation of v with its inverse, cor(v, -v), is -1.0. We define ce(v, p) as a transformation of v. ce(v, p) inverts the signs of the components of v with probability p independently but leaves the magnitude unchanged. It follows that v = ce(v, 0.0), -v = ce(v, 1.0) and ce(v, 0.75) inverts the signs of about 75%² of randomly chosen components of v. We can interpret v as the classification vector of a two-class classification problem where positive and negative signs indicate whether some elements belong to one class or the other. Under this interpretation, ce(v, p) represents a special kind of classification error in which randomly chosen p% of the elements are misclassified into the incorrect class (hence the name ce of the function). This error is of a special kind because it mismarks the class membership but leaves the magnitude of evaluation unchanged. The following table and code provide a few illustrative examples of these concepts using (1,2,3,4,5,6,7) as the vector (The R-code used in production of this article is available at the bottom of the article as a single program. We have shown important code-snippets in-line).

ce <- function(v, p) { #probabilistically introduce p% classification error in a perfect vector
  #flip sign probabilistically
  sapply(v, FUN = function(x) {if (runif(1) < p) -x else x})
}

Item Value Comment

v (1,2,3,4,5,6,7) A small example vector

ce(v, 1.0) (-1,-2,-3,-4,-5,-6,-7) Every component is inverted with p = 1.0.

ce(v, 0.0) (1,2,3,4,5,6,7) No component is inverted with p = 0.0.

ce(v, 0.5) (1,-2,-3,-4,-5,-6,7) About 50% values inverted.

ce(v, 0.5) (1,2,3,-4,-5,-6,7) Potentially different 50% values inverted.

cor(v, ce(v, 0.0)) 1 Perfect correlation due to no inversion.

cor(v, ce(v, 1.0)) -1 Perfect inverse correlation due to full inversion.

cor(v, ce(v, 0.5)) 0.256 Correlation declined from 1.0 due to sign inversion.

Item	Value	Comment
v	(1,2,3,4,5,6,7)	A small example vector
ce(v, 1.0)	(-1,-2,-3,-4,-5,-6,-7)	Every component is inverted with p = 1.0.
ce(v, 0.0)	(1,2,3,4,5,6,7)	No component is inverted with p = 0.0.
ce(v, 0.5)	(1,-2,-3,-4,-5,-6,7)	About 50% values inverted.
ce(v, 0.5)	(1,2,3,-4,-5,-6,7)	Potentially different 50% values inverted.
cor(v, ce(v, 0.0))	1	Perfect correlation due to no inversion.
cor(v, ce(v, 1.0))	-1	Perfect inverse correlation due to full inversion.
cor(v, ce(v, 0.5))	0.256	Correlation declined from 1.0 due to sign inversion.

The above table shows that when we invert some components of the vector but not all, the correlation is in the range (-1.0, 1.0). Below we examine how the correlation changes when we vary p from 0 to 1 in small steps. We examine this by making v a sufficiently long, uniformly distributed vector (u(0,1)). We choose v to have many components (length=100), so even small changes in p can make a perceptible difference in computed values. In the balance of this article, we will find the correlation between v and ce(v, p)³ for different values of p and differently constructed v’s and find situations where correlation varies linearly and predictably as the value of p is changed. This systematic change in correlation values with p will allow us to use p as an interpretation of correlation.

Correlation decreases as classification error increases

The following R code initializes v as a uniformly distributed vector and plots cor(v, ce(v, p)) for different values of p.

v = runif(VECTOR_LENGTH) #v belongs to U(0,1)
correlations = sapply(p, FUN=corrAtP) 
plotCorrelationVsP(p, correlations, "Classification error as a fraction", "Corr (v distributed U(0,1))")

The graph shows that for p = 0 when the two vectors are identical, the correlation is 1.0. At p = 1 when the vectors are mutually fully inverted, the correlation is -1. For the intermediate values of p, the correlation declines from 1 to 0 but in a jagged manner. The jaggedness may be due to the probabilistic nature of the experiment. We now attempt to reduce this jaggedness by repeating this experiment a large number of times (200) for each value of p and looking at the average value of correlation. The following graph shows the average correlation between v and ce(v, p) for each value of p.

Average correlation decreases smoothly as classification error increases

The graph below shows the average value of correlation at each value of p for uniformly distributed vectors.

v = runif(VECTOR_LENGTH) #v belongs to U(0,1)
correlations=sapply(p, FUN=avCorrAtP)
plotCorrelationVsP(p, correlations, "Classification error as a fraction", "Ave Corr (v distributed U(0,1))")

Now there is an apparent monotonic decline in average correlation with increasing values of p, but the decline is not linear, and we do not know if the graph follows any understandable function. For this graph to be useful for interpretation, we should either be able to quantify the curvature in the graph or remove the curvature entirely. In our formulation, so far, v is drawn from u(0, 1) which is not symmetric about 0. What if we try the normal distribution (N(0,1)) instead⁴? Luckily Normal distribution provides us the easy escape that we are looking for.

Average correlation decreases linearly when v is from N(0,1)

The next plot shows the average correlation for each value of p for normally distributed v.

v <- rnorm(VECTOR_LENGTH, 0, 1) #same as the above but v distributed N(0,1)
correlations= sapply(p, FUN=avCorrAtP)
plotCorrelationVsP(p, correlations, "Classification error as a fraction", "Ave Corr (v distributed N(0,1))")

With symmetrically and normally distributed v, the expected correlation between vector v and ce(v, p) is 1-2p.

A thumb-rule to interpret correlation

Recall that we interpret sign inversion as a classification error. We can now formulate the following thumb-rule.

As long as the classification vector is normally distributed, for every one point increase in classification error, the correlation between the correct and incorrect predictions decreases by 2 points. Or, equivalently, correlation value c corresponds to (1-c)/2 classification error (expressed as a fraction).

Linear relationship holds for many symmetric distributions

We will now show that the above linear pattern holds for many other symmetric distributions also and not just for the normal distribution. We conjecture that it might hold for any symmetric distribution. Below we show the linear decline to hold for t, U(-0.5, .5), and an unknown but symmetric distribution.

Linearity holds for t-distribution

The code and graph below show that correlation declines linearly with classification error when the original vector follows the t-distribution.

v <- rt(VECTOR_LENGTH, df = 5) #same experiment as above but with v following t-distribution
correlations = sapply(p, FUN=avCorrAtP)
plotCorrelationVsP(p, correlations, "Classification error as a fraction", "Ave Corr (v distributed as t)")

Linearity holds for symmetric uniform distribution

We now show that correlation declines linearly with classification error when the original vector follows a uniform but symmetric distribution.

v <- runif(VECTOR_LENGTH) -.5  #as as above but v distributed U(-0.5, 0.5)
correlations = sapply(p, FUN=avCorrAtP)
plotCorrelationVsP(p, correlations, "Classification error as a fraction", "Ave Corr (v distributed U(-0.5, 0.5))")

Linearity holds for an unknown but symmetric distribution

The following code and graph show that correlation declines linearly with classification error when the original vector follows an unknown symmetric distribution.

v <- unknownSymmetricDist(20, VECTOR_LENGTH)#as as above but v follows an unknown but symmetric distribution
correlations = sapply(p, FUN=avCorrAtP)
plotCorrelationVsP(p, correlations, "Classification error as a fraction", "Ave Corr (v dist. unknown symmetric)")

Discussion

The above experiments seem to indicate that there is an inverse linear relationship between correlation and classification error when the original vector follows a symmetric distribution. Given this relationship, a correlation value c can be roughly linked to a classification error (1-c)/2 (expressed as a fraction). With this interpretation, the conventionally accepted high value of correlation (0.8) links to about 10% classification error. Conventionally accepted low correlation value (0.3) links to as high as 35% classification error.

We found that this interpretation held for many different symmetric distributions including one that was symmetric by design but unknown otherwise. We believe that the above interpretation may hold for many symmetrically distributed vectors, but this belief needs to be proved (or disproved) analytically. At this moment we think that there is enough empirical evidence to use the interpretive rule that correlation c, roughly, relates to (1-c)/2 classification error.

Since the correlation between vectors is not sensitive to scaling and translation of vectors and some unsymmetric distributions may be made symmetric (or less unsymmetric) by translation, it will be instructive to investigate the extent to which this interpretation of correlation holds even when the involved distributions are not symmetric.

It is, also, entirely possible to create other similar interpretations. For example, it might be possible to compute correlations between a vector and its rotations about its endpoints or center. Some of these alternative formulations may yield other useful patterns.

Appendix

The extire R code used in this document is reproduced below:

knitr::opts_chunk$set(echo = FALSE)
library(ggplot2)
library(knitr)
library("kableExtra")
library("lattice")
set.seed(1089)
options(digits=3)
VECTOR_LENGTH = 100
EXPERIMENT_LENGTH = 200

p <- 0:100 * .01 #using probabilities in increments of .01

#introduce p% classification error in a perfect vector
ce <- function(v, p) {
  #flip sign probabilistically
  sapply(v, FUN = function(x) {if (runif(1) < p) -x else x})
}

#Flip signs of components of v with probability p and return correlation 
corrAtP <- function(p) {
  return(cor(v, ce(v, p)))
}

#flip signs as the above and return average correlation after many trials
avCorrAtP <- function(p) {
  return(sum(sapply(1:EXPERIMENT_LENGTH, FUN=function(x){return(corrAtP(p))}))/EXPERIMENT_LENGTH)
}

#function plots a simple graph
plotCorrelationVsP <- function(x, y, xlab, ylab) {
  ggplot(data=data.frame("ClassificationError" = x, "AverageCorrelation"=y), 
         aes(x=ClassificationError, y=AverageCorrelation, group=1)) +
         geom_line() +
         geom_point() +
         labs(x = xlab, y = ylab)
}

#print a vector bit more nicely.
prettyPrint <- function(v) {paste("(", paste(v, sep=",", collapse = ","), ")", sep="")}

#generate a random but symmatric distribution.  Can be made more efficient.
unknownSymmetricDist <- function(b, vectorLength) {
    v <- runif(b); #random vector
    ret = c();#returned vector
    target = 1
    while (length(ret) < vectorLength) {
      #histogram count proportion to random frequency
        if (runif(1) < v[target]) {
          ret = c(ret, target, -target)#for summetry add a point and its reflection
        }
        target = target + 1
        if (target > b) {
          target = 1
        }
    }
      ret/b #scale to [-1, 1]
}


v = runif(VECTOR_LENGTH) #v belongs to U(0,1)
correlations = sapply(p, FUN=corrAtP) 
plotCorrelationVsP(p, correlations, "Classification Error as Fraction", "Corr (v distributed U(0,1))")

v = runif(VECTOR_LENGTH) #v belongs to U(0,1)
correlations=sapply(p, FUN=avCorrAtP)
plotCorrelationVsP(p, correlations, "Classification Error as Fraction", "Ave Corr (v distributed U(0,1))")

v <- rnorm(VECTOR_LENGTH, 0, 1) #same as the above but v distributed N(0,1)
correlations= sapply(p, FUN=avCorrAtP)
plotCorrelationVsP(p, correlations, "Classification Error as Fraction", "Ave Corr (v distributed N(0,1))")

v <- rt(VECTOR_LENGTH, df = 5) #same experiment as above but with v following t-distribution
correlations = sapply(p, FUN=avCorrAtP)
plotCorrelationVsP(p, correlations, "Classification Error as Fraction", "Ave Corr (v distributed as t)")

v <- runif(VECTOR_LENGTH) -.5  #as as above but v distributed U(-0.5, 0.5)
correlations = sapply(p, FUN=avCorrAtP)
plotCorrelationVsP(p, correlations, "Classification Error as Fraction", "Ave Corr (v distributed U(-0.5, 0.5))")

v <- unknownSymmetricDist(20, VECTOR_LENGTH)#as as above but v follows an unknown but symmetric distribution
correlations = sapply(p, FUN=avCorrAtP)
plotCorrelationVsP(p, correlations, "Classification Error as Fraction", "Ave Corr (v dist. unknown symmetric)")

Notes

Negative correlation values have similar but inverse interpretation.↩
This article uses fraction p and percentage 100*p interchangeably and lets the user interpret the meaning correctly depending on the context.↩
Even if v is a vector with positive variance, there is no guarantee that ce(v, p) also has positive variance. However, for randomly distributed and sufficiently long vectors, we safely assume that both v and ce(v, p) have finite variances.↩
Normal distribution is symmetric about the y-axis while uniform is not.↩