Measures of Association

Author

Minerva Mukhopadhyay

Lecture 1

The methods we discussed in the previous module are good for dealing with one variable. When there are more than one variables available in a data set, measures are needed to quantify associations among the variables. For example, one can consider Karl Pearson’s experiment to find the relation between father’s height and son’s height.

Pearson measured the height of 1078 fathers, and their sons at maturity. The scatter plot of father’s height (\(x\)-axis) and son’s height (\(y\)-axis) is given below:

This figure is taken from the book ‘Statistics’ of Freedman, Pisani and Purves.

Although the scatter plot shows an association between father’s height and son’s height, it is important to know the strength of the association. If the association is strong enough, then by knowing one of these variables helps a lot in predicting the other.

Pearson’s correlation coefficient:

  • When both the variables under consideration, say \(x\) and \(y\), are of continuous type, and the association between \(x\) and \(y\) appears to be linear (from the scatter diagram), then the correlation coefficient is an appropriate measure of association between the two variables.

  • Let \(\{(x_{i},y_{i}); i=1,\ldots,n\}\) be a bivariate sample of size \(n\) on a pair of variables \(x\) and \(y\). The correlation coefficient between \(x\) and \(y\) is denoted by \(r_{x,y}\) or \(r_{y,x}\), and is given by \[ r_{x,y} = \frac{\text{Cov}(x,y)}{\text{SD}(x)\times \text{SD}(y)}, \quad \text{Cov}(x,y) = \frac{1}{n} \sum_{i=1}^{n} (x_{i}-\bar{x})(y_{i}-\bar{y}). \]

  • How can \(r_{x,y}\) capture the linear relation between \(x\) and \(y\)?

    This figure is taken from the book ‘Fundamental of Statistics’, Volume 1.

    • Observe that if the association is linear and positive (i.e., \(x\) increases when \(y\) increases, and vice versa), then \(\text{Cov}(x,y)\) is high and positive. If the association is linear and negative (i.e., \(x\) increases when \(y\) decreases, and vice versa), then \(\text{Cov}(x,y)\) is high and negative.
  • How large or how small?

    • Referring to the Cauchy-Schwarz inequality, one can show that \(|\text{Cov}(x,y)| \leq \text{SD}(x)\times \text{SD}(y)\). Therefore \[ -1 \leq r_{x,y} \leq 1. \]

  • Person’s correlation coefficient is also called product moment correlation coefficient.

Properties of correlation coefficient:

  1. As evident from the formula, \(r_{x,y}\) is an absolute measure.

  2. Let \(x^{\star}=a+bx\) and \(y^{\star}=c+dy\), then \(r_{x^{\star},y^{\star}} = r_{x,y}\) if \(c\) and \(d\) are of same sign, and \(r_{x^{\star},y^{\star}} = - r_{x,y}\) if \(b\) and \(d\) are of opposite sign.

  3. [Interpretation] One way to look at the correlation coefficient is as the dot-product of the standardized data vector \({\bf x}^{\star}=\left[(x_{1}-\bar{x})/SD(x), \ldots, (x_{n}-\bar{x})/SD(x) \right]^{\top}\) and \({\bf y}^{\star}=\left[(y_{1}-\bar{y})/SD(y), \ldots, (y_{n}-\bar{y})/SD(y) \right]^{\top}\). Therefore, \(r_{x,y}\) can be interpreted as \(\text{cosine}(\theta)\), where \(\theta\) is the angle between \({\bf x}^{\star}\) and \({\bf y}^{\star}\). (Why?)

  4. \(|r_{x,y}|=1\) implies there exists a perfect linear agreement between \(x\) and \(y\), i.e., there exists some \(a\) and \(b\neq 0\) such that \(y_{i}=a+bx_{i}\) (or, \(x_{i}=a+by_{i}\)) for each \(i=1,\ldots,n\). If the association is positive, i.e., \(b>0\), then \(r_{x,y}=1\), and if \(b<0\) then \(r_{x,y}=-1\). (Why?)

    \(r_{x,y}=0\) implies the relation between \(x\) and \(y\) is not linear.

Linearity:

If the relation between \(x\) and \(y\) is not linear, or there exist some outlying influential observations in the data, then \(r_{x,y}\) may fail to measure association between \(x\) and \(y\).

Note:

Example:For school children, shoe size is strongly associated with reading skills. However, learning new words does not make the feet bigger. Instead there is a third factor involved - age. As the children get older, they learn to read better and they outgrow their shoes.

In this example, the confounder is easy to spot. Often, this is not so easy. And the arithmetic of correlation coefficient does not protect you against third factors.Freedman, Pisani and Purves (2007)

Lecture 2: Other Measures of Association

There are situations when the measurements of the samples on \(x\) (and/or \(y\)) are not available (or not meaningful), and only the relative orderings are available (or meaningful).

  • Example: Suppose you are interested in finding association between the performance of a class of students in two different courses, and you have the data on marks obtained by the students in the end semester exam. The grading pattern of the instructors can be quite different, and the exact marks is just a relative measure of the students’ performance. Even if there is a nearly perfect association in the performance of the students, Pearson’s correlation coefficient may not be able to unfold the truth due to the linearity constraint.

Spearman’s rank correlation coefficient:

Suppose \(n\) subjects/individuals are ordered with respect to two attributes \(A\) and \(B\), in the orders \({\bf u}=(u_{1},\ldots, u_{n})^{\top}\) and \({\bf v}=(v_{1},\ldots, v_{n})^{\top}\), respectively. Clearly \(u_{i}\)s and \(v_{j}\)s are permutations of integers \(\{1,\ldots,n\}\). Spearman’s rank correlation coefficient is the Pearson’s correlation coefficient between \({\bf u}\) and \({\bf v}\), denoted by \(r_{R}\), \[ r_{R} = \frac{\text{cov}(u,v)}{SD(u)\times SD(v)}.\]

Properties of Spearman’s rank correlation:

  1. Let \(d_{i}\) be the difference between the rank w.r.t. \(A\) and that w.r.t. \(B\), i.e., \(d_{i}=(u_{i}-v_{i})\). Then \[ r_{R} = 1 - \frac{6\sum_{i=1}^{n} d_{i}^{2}}{n(n^{2}-1)}. \] (Why?)

  2. The interpretation of Spearman’s rank correlation is clear from the above formulation. If there is a perfect agreement among the \(u\)-series and \(v\)-series, then \(d_{i}=0\), for all \(i=1,\ldots,n\). Consequently, \(r_{R}=1\). If there is a perfect disagreement between \(u\)-series and \(v\)-series, i.e., the ranking of the \(u\)-series is completely reverse of that of the \(v\)-series, then \(u_{i}=n+1-v_{i}\) and \(d_{i}=(n+1-2v_{i})\). In that case, \(r_{R}=-1\). (Why?)

When the ranks are tied:

  • Suppose there is a tie of length \(k\) in the \(u\)-series and there are no ties in the \(v\)-series. The \(k\) individuals receiving the same rank follow \(r\) other individuals. If each of these \(k\) individuals receives a separate rank, then the mean of the ranks would be \[ \frac{1}{k}\{ (r+1)+ \ldots + (r+k)\} = r + \frac{(k+1)}{2}. \] In order to ensure that the tie does not affect the mean of \(u\), one may assign the average rank \(r+(k+1)/2\) to each of these \(k\) individuals. Then \(\bar{u}=(n+1)/2\) remains unaffected.

  • This assignment will however change (reduce) the standard deviation as expected, and the difference in the untied variance and the tied variance would come from that in \(n^{-1}\sum_{i=1}^{n} u_{i}^{2}\). The variance in the tied case would be \[ \text{var} (u) = \frac{n^{2}-1}{12} - \frac{k(k^{2}-1)}{12n}. \]

  • Consequently, the covariance between \(u\)-series and \(v\)-series will be changed as follows: \[ \text{cov}(u,v) = \frac{1}{2}\text{var}(u)+ \frac{1}{2}\text{var}(v) - \frac{1}{2n}\sum d_{i}^{2} = \frac{n^{2}-1}{12} - \frac{k(k^{2}-1)}{24n} - \frac{1}{2n}\sum_{i=1}^{n} d_{i}^{2}.\]

  • In general if there are \(s\) ties in the \(u\)-series of lengths \(k_{1}, \ldots, k_{s}\), and \(t\) ties in the \(v\)-series of lengths \(l_{1}, \ldots, l_{t}\), and an average rank is provided in each case, then \[ \text{var} (u) = \frac{n^{2}-1}{12} - \frac{1}{12n}\sum_{j=1}^{s}k_{j}(k_{j}^{2}-1), ~ \text{var} (v) = \frac{n^{2}-1}{12} - \frac{1}{12n}\sum_{j=1}^{t}l_{j}(l_{j}^{2}-1), \] and \[ \text{cov}(u,v) = \frac{n^{2}-1}{12} - \frac{1}{24n}\sum_{j=1}^{s}k_{j}(k_{j}^{2}-1) - \frac{1}{24n}\sum_{j=1}^{t}l_{j}(l_{j}^{2}-1) - \frac{1}{2n}\sum_{i=1}^{n} d_{i}^{2}. \]

Kendall’s Tau:

  • Another tool for measuring association between two ordinal variables, or two pairs of ranks is Kendall’s \(\tau\).

  • As before, suppose \({\bf u}=(u_{1},\ldots, u_{n})^{\top}\) and \({\bf v}=(v_{1},\ldots, v_{n})^{\top}\) be a pair of ranks, or \(n\) paired observations on two ordinal variables.

  • For each possible pair of individuals, consider the order of this pair in the two ranking. If the pair appears in the same order then allot it a score \(+1\), and if it appears in the reverse order, allot a score \(-1\). For example, if we consider the pair of individual \((1,2)\), then we look at the order of \((u_1,u_2)\) and \((v_{1},v_{2})\). If \(u_{1}>u_{2}\) and \(v_{1}>v_{2}\), or \(u_{1}<u_{2}\) and \(v_{1}<v_{2}\), then assign the score \(+1\). However, if \(u_{1}>u_{2}\) and \(v_{1}<v_{2}\), or \(u_{1}<u_{2}\) and \(v_{1}>v_{2}\), then assign the score \(-1\).

  • As there are \(\binom{n}{2}\) possible pairs, the maximum score can be \(\binom{n}{2}\). Thus the Kendall’s \(\tau\) is defined as \[ \tau = \frac{\text{\# total score}}{\binom{n}{2}}.\]

  • Obviously, \(\tau=1\) if there is a perfect agreement, and \(\tau=-1\) if there is a perfect disagreement between \(u\) and \(v\).

  • Alternative method: The calculation of \(\tau\) can be made easy in the following way:

    Suppose the order of the individuals are rearranged in such a way that the ranking of the \(u\)-series is in the natural order \(\{1,\ldots,n\}\). Now consider the corresponding rank in the \(v\)-series. Among the \(\binom{n}{2}\) pairs \(\{(v_{i},v_{j}), ~i>j\}\) in the \(v\)-series, if \(v_{i}>v_{j}\) the assign the score \(+1\), otherwise assign the score \(-1\). Let \(P\) be the total number of positive pairs (pairs having \(1\) scores), and \(Q\) be that of negative pairs, then \[ \tau = \frac{\text{\# total score}}{\binom{n}{2}} = \frac{P-Q}{\binom{n}{2}} = 1 - \frac{2Q}{\binom{n}{2}}= 1 - \frac{2Q}{P+Q}.\]

  • Kendall’s \(\tau\) as Pearson’s correlation coefficient: Let \(a_{i,j}\) and \(b_{i,j}\), \(i,j=1,\ldots,n\), ~\(i<j\), be defined as follows: \[ a_{i,j} = \left\{ \begin{array}{ll} +1 \quad \text{if }~ u_{i} > u_{j} \\ -1 \quad \text{if }~ u_{i} < u_{j} \end{array}\right. , \qquad \text{and} \qquad b_{i,j} = \left\{ \begin{array}{ll} +1 \quad \text{if }~ v_{i} > v_{j} \\ -1 \quad \text{if }~ v_{i} < v_{j} \end{array}\right. . \] Then it can be verified that \[ \tau = \frac{\sum_{i<j} a_{i,j} b_{i,j}}{\sqrt{\sum_{i<j} a_{i,j}^{2} ~\times ~\sum_{i<j} b_{i,j}^{2}}}.\]

When the ranks are tied:

  • Suppose there is a tie of length \(k\) in the \(u\)-series. Then we consider \(a_{i,j}=0\) if \(u_{i}=u_{j}\) in the above formulation.

  • In general if there are \(s\) ties in the \(u\)-series of lengths \(k_{1}, \ldots, k_{s}\), and \(t\) ties in the \(v\)-series of lengths \(l_{1}, \ldots, l_{t}\), then \[ \tau = \frac{\text{total score}}{\sqrt{\left(\binom{n}{2} - T_{u} \right)\times \left(\binom{n}{2} - T_{v} \right)}}, \qquad \text{where}\quad T_{u} =\frac{1}{2}\sum_{j=1}^{s}k_{j}(k_{j}-1), \quad \text{and}\quad T_{v}= \frac{1}{2}\sum_{j=1}^{t}l_{j}(l_{j}-1). \]

Comparison of Pearson’s correlation coefficient and rank correlation coefficient:

  1. Suppose the relation between \(x\) and \(y\) is not linear, but monotone, i.e., \(x_{1}>x_{2} \Leftrightarrow y_{1}>y_{2}\). For example \(y=x^{2}\) where \(x>0\). In such cases rank correlations are more effective in identifying the association between \(x\) and \(y\).

  2. In a linear case, rank correlations are less powerful than \(r_{x,y}\) as they use less information in their calculations.

  3. Sampling distribution of the rank correlations are known and does not require any assumption on the underlying distribution of \(x\) and \(y\).

Comparison Spearman’s rank correlation and Kendall’s rank correlation:

  1. In general, Kendall correlation is more robust (in the presence of outliers) and efficient than Spearman’s rank correlation.

  2. Spearman’s rho usually is larger than Kendall’s tau.

  3. Kendall’s \(\tau\) enjoys better sample properties than Spearman’s rank correlation. However, in the presence of ties the performance of Spearman’s rank correlation is better. (Reference)

Association between binary variables:

Suppose we have \(2\times 2\) contingency table containing the summarized data on two binary variable as follows:

To find the association between the drug and cured patients, one needs to device a measure which takes a higher (or lower) value if \(a\) and \(d\) are higher compared to \(c\) and \(b\), and a lower (or higher) value if \(a\) and \(d\) are smaller, compared to \(c\) and \(d\). Further the measure should take highest possible value if at least one of \(c\) and \(b\) is zero, lowest (or highest) possible value if at least one of \(a\) and \(d\) is zero.

Many measures can be formed which satisfy these criteria. One of the most popular measure is \[ Q = \frac{ad -bc }{ad+bc}.\] Note that \(-1\leq Q \leq 1\).

Empirical justification:

  • Suppose ‘State of the patient’ is a binary variable taking two possible values, \(1\) if cured and \(-1\) if not cured. Let the true proportion of the cured patients be \(\alpha\in(0,1)\).

  • Similarly, ‘Drug’ is a binary variable taking the value \(+1\) if it is ‘New’, and \(-1\) if it is ‘Standard’. Let the true proportion of the patients receiving the ‘New’ drug be \(\beta\in(0,1)\).

  • Writing \(n= a+b+c+d\), we expect that the quantities \(a+c\), \(b+d\), \(a+b\) and \(c+d\) must be close to \(n\alpha\), \(n(1-\alpha)\), \(n\beta\) and \(n(1-\beta)\), respectively.

  • In case, the variables ‘State of the patient’ and ‘Drug’ are independent, then \(a\), \(b\), \(c\) and \(d\) must be close to \(n\alpha\beta\), \(n(1-\alpha)\beta\), \(n\alpha(1-\beta)\) and \(n(1-\alpha)(1-\beta)\).

  • Thus if ‘State of the patient’ and ‘Drug’ are independent, then \(Q=0\).

  • Further, in case of complete association/ disassociation \(Q\) takes an extreme value.

Reference: Yule and Kendall (1911).

Lecture 3: Regression Analysis:

Summary Statistics in Higher Dimension:

  • The data set we see in practice, are usually multivariate, for example, recall the Cars93 data set.

  • Let \({\bf x}=(x_{(1)},\ldots,x_{(p)})^{\top}\) be a vector valued feature, where for each \(j\) , \(x_{(j)}\) is the \(j\)-th feature. Suppose we have a sample of size \(n\), \({\bf x}_{1}, \ldots, {\bf x}_{n}\). We can express the whole data in a matrix form as follows:

    \[ {\bf X} = \begin{bmatrix}x_{1,1}& \ldots & x_{p,1} \\ \vdots & \vdots & \vdots \\ x_{1,n}& \ldots & x_{p,n} \end{bmatrix}. \] The \(i\)-th row of \({\bf X}\), \({\bf x}_{i}\), contains the \(i\)-th sample, and the \(j\)-th column of \({\bf X}\), \({\bf x}_{(j)}\), contains the list of \(n\) samples on the \(j\)-th variable.

  • The measures used to summarize univariate data can not always be directly extended to multivariate setup. Some special cases will be considered here.

  • Mean: The arithmetic mean can directly be extended to higher dimensions in a natural way, as the mean of the \(p\) variables expressed in a vector form: \[ \bar{\bf x} ~~= ~~\begin{bmatrix} \bar{x}_{(1)} \\ \vdots \\ \bar{x}_{(p)} \end{bmatrix} ~~=~~ \frac{1}{n} \sum_{i=1}^{n} {\bf x}_{i}~~ =~~ \frac{1}{n} {\bf X}^{\top}{\bf 1}. \]

  • Variance-Covariance Matrix: In a similar manner, if we want to extend the idea of variance in higher dimension, then we obtain the following matrix: \[ \mathrm{var}({\bf X})~~ = ~~ \frac{1}{n} \sum_{i=1}^{n} ({\bf x}_{i} -\bar{\bf x})({\bf x}_{i} -\bar{\bf x})^{\top} ~~= ~~ \begin{bmatrix} \mathrm{var}(x_{(1)}) & \mathrm{cov}(x_{(1)},x_{(2)}) & \ldots & \mathrm{cov}(x_{(1)},x_{(n)}) \\ \vdots & \vdots & \vdots & \vdots \\ \mathrm{cov}(x_{(p)},x_{(1)}) & \mathrm{cov}(x_{(p)},x_{(2)}) & \ldots & \mathrm{var}(x_{(p)}) \end{bmatrix}. \]

    • The matrix \(\mathrm{var}({\bf X})\), termed as the variance covariance matrix, is one possible generalization to univariate measure of dispersion. However, it is convenient to have a single number to measure the multivariate scatter. Towards that two common measures are

      (i) generalized variance \(\det(\mathrm{var}({\bf X}))\),

      (ii) total variance \(\mathrm{trace}(\mathrm{var}({\bf X}))\).