Covariance-Correlation

2022-02-17

Motivation

Let’s study our body measures

Body measurements with R

(Heinz et al. 2003)

library(knitr)
library(tidyverse)
library(magrittr)
library(GGally)
library(gridExtra)
library(openintro)

# Loading the data
data(bdims)
# help(bdims)

Codebook

Expectation

We expect:

certain body measures to be “associated” or “correlated” with each other
- bic_gi (bicep girth) vs for_gi (forearm girth)

We expect:

certain body measures to be independent
- bia_di (biacromial diameter) vs thi_gi (thigh girth)

Sometimes there is doubt

For others, the most certain thing is: who knows -_-!

che_di (chest diameter) vs nav_gi (navel (abdominal) girth)

Let look at the data

Some plots

bdims %>% 
  ggplot(aes(bic_gi,for_gi)) +
  geom_point()+
  xlab("bic_gi: bicep girth")+
  ylab("for_gi: forearm girth") -> p1

bdims %>% 
  ggplot(aes(bia_di,thi_gi)) +
  geom_point()+
  xlab("bia_di: biacromial diameter")+
  ylab("thi_gi: thigh girth") -> p2

bdims %>% 
  ggplot(aes(che_di,nav_gi)) +
  geom_point()+
  xlab("che_di: chest diameter")+
  ylab("nav_gi: navel (abdominal) girth") -> p3

Let’s look at the data

grid.arrange(p1,p2,p3,nrow=1)

How do we measure these associations?

Recall

The distribution of a RV

cdf: \(F_X(x):=P(X \le x)\)

pdf: \(f_X\) is such that \(F_X(x)=\int_{-\infty}^x f_X(t)dt\)

Normal case \(X \sim N(\mu,\sigma^2)\) if \(f_X(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}\)

Expectation of functions of RVs

\(E_{f_X}(g(X)):=\int_{-\infty}^{\infty} g(x)f_X(t)dt\)

Centrality and dispersion

\(E_{f_X}(X):=\int_{-\infty}^{\infty} xf_X(t)dt=\mu_X\)
\(Var(X)=E(X-\mu_X)^2=E(X^2)-\mu_X^2=\sigma_X^2\)

Two RVs

Joint distribution

cdf:\(F_{X,Y}(x,y):=P(X\le x, Y\le y)\)
pdf: \(f_{X,Y}\) is such that \(F_{X,Y}(x,y)=\int_{-\infty}^y \int_{-\infty}^x f_{X,Y}(s,t)dsdt\)
Expectation: \(Eg(X,Y))=\int_{-\infty}^\infty \int_{-\infty}^\infty g(x,y) f_{X,Y}(x,y)dxdy\)
Conditional distribution: \(f_{X|Y}(y|x):=\frac{f_{X,Y}(x,y)}{f_{X}(x)}\)
Independence: \(f_{X,Y}(x,y)=f_{X}(x)f_{Y}(y)\) \(\Rightarrow\) \(f_{Y|X}(y|x)=f_{Y}(y)\)

Covariance (today’s business)

\(X \sim f_{X}(x)\), \(Y \sim f_{Y}(y)\), and \((X,Y) \sim f_{X,Y}(x,y)\)
\(E_{f_X}(X)=\mu_X\), \(E_{f_Y}(Y)=\mu_X\), and \(E_{f_{XY}}(XY)=E(XY)\) (marginal vs joint)

Define

\[Cov(X,Y):=E_{f_{XY}}\{(X-\mu_X)(Y-\mu_Y)\}=E_{f_{XY}}(XY)-\mu_X\mu_Y\]

Exercise 1: Prove the latest equality without getting rid of the subscript at \(E_{f_{XY}}(.)\), i.e. in between, show that \(E_{f_{XY}}(Y)=\mu_Y\) and then \(E_{f_{XY}}(X)=\mu_X\).

Correlation

\[\rho_{XY}:=\frac{Cov(X,Y)}{\sigma_X\sigma_Y}=\frac{Cov(X,Y)}{\sqrt{\sigma_X^2\sigma_Y^2}}\]

That is the covariance divided by the product of the standard deviations.

Properties

Independence \(\implies\) \(Cov(X,Y)=0\) \(\implies\) \(\rho_{XY}=0\) (Exercise 2)

\(var(aX+bY)=a^2VarX + a^2VarX + 2abCov(X,Y)\)
Independence \(\implies var(aX+bY)=a^2VarX + a^2VarX\)

Properties

But the converse is not true.

\(\rho_{XY}=0\) \(\nRightarrow\) Independence

Why?

Theorem 4.5.7

[Casella-Berger]

For any X,Y RVs

\(-1 \le \rho_{X,Y} \le 1\)
\(|\rho_{X,Y}|=1 \iff\) there is a line such that \(P(Y=aX+b)=1\) (\(a \ne 0\) and \(b\))
\(\rho_{X,Y}=1 \implies a \gt 0\) (X increases \(\implies\) Y also increases)
\(\rho_{X,Y}=-1 \implies a \lt 0\) (X increases \(\implies\) Y decreases)

So,

We are measuring a linear relationship
Not any other type of relationship
Dimensionless
Causation is another thing

Graphical example 1

(Vu and Harrington, n.d.)

Graphical example 2

(Çetinkaya-Rundel and Hardin, n.d.)

Graphical example 3

(Çetinkaya-Rundel and Hardin, n.d.)

Graphical example 4

( “Pearson Correlation Coefficient” 2022)

Cutoffs

Broadly speaking (it depends on the particular problem) but

Absolute value of \(\rho\)	Strength of relationship
\(\|\rho\| \lt 0.25\)	No linear relationship
\(0.25 \le \|\rho\| \lt 0.25\)	Weak
\(0.5 \le \|\rho\| \lt 0.75\)	Moderate
\(\|\rho\| \ge 0.75\)	Strong

Inference

Point estimation for the correlation

Maximum Likelihood Estimator: Point estimate that maximizes the likelihood given the sample that you have

\[L(\rho;sample ):=\prod_{i=1}^nf(x_i,y_i;\rho)\]

Typically found using calculus (derivation and set to zero)
If \((X,Y) \sim f(x,y;\rho)\) bivariate normal then the Maximum Likelihood Estimator for \(rho\) is:

\[r=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}=\frac{SS_{XY}}{\sqrt{SS_X SS_Y}}\]

This is known as the Pearson correlation coefficient

Hypothesis test and Confidence Intervals

Permutation test (without replacement)
Bootstrap (with replacement)
Exact distribution (\(r \sim Normal_2\))
Student’s t-distribution (tipically used to test against zero)
Fisher transformation (Mostly used, but…)

Confidence intervals are preferred. If testing, think if testing against 0 is appropriate or not in clinical context and sample size in mind.

t-distribution

\(H_0\): \(\rho=\rho_0=0\)

\(H_1\): \(\rho \ne 0\) (two sided)

\(H_1\): \(\rho > 0\) or \(\rho < 0\) (one sided sided)

\(t:=r\sqrt{\frac{n-2}{1-r^2}} \sim t_{n-2}\)

\(Reject ~ H_0 ~ if ~ |t|<t_{n-2,1-\alpha/2}\)

\(Fail ~ reject ~ H_0 ~ if ~ |t|<t_{n-2,1 - \alpha/2}\)

The founder: - Excuse me!

Fisher transformation

Don’t worry about the complicated formulas, Sir Ronald already did for us -_-!

\(F(r):=\frac{1}{2}ln(\frac{1+r}{1-r})=arctanh(r)\) (Inverse Hyperbolic Tangent)
\(F(r)\) approximately (very quickly) follows a normal distribution \(N(arctanh(\rho),\frac{1}{n-3})\)
First compute a confidence interval for \(F(r)\)
then apply the inverse Fisher transformation to that interval to obtain:
\(100(1-\alpha)\%CI:\) \[[tanh(arctanh(r)-z_{\alpha/2}\frac{1}{\sqrt{n-3}}), tanh(arctanh(r)-z_{\alpha/2}\frac{1}{\sqrt{n-3}})]\]

Other types of correlations

Watch out!

When you have data that deviate too much from your model assumptions (multivariate normality, conditional normality)

Ordinal variables

Outliers

People with influences

Alternatives

Confirm and clean data
Categorization (Maybe to extreme)
Monotonic relationship (not necessarily in a line): Spearman
Nerdier correlation coefficients, mentioned in a moment

Spearman Correlation

At population level it is defined for the ranked variables as:

\[r_s:=\rho_{R(X),R(Y)}=\frac{Cov(R(X),R(Y))}{\sigma_{R(X)}\sigma_{R(Y)}}\] - From the sample:

First rank the data: Sort the data \(\rightarrow\) which position they have
Handle ties: Say two people tie in positions 2 and 3, assign the average of \(\{2,3\}\), that is those first 4 values will be \((1,2.5,2.5,4)\).
Use Pearson on the ranked data: Let \(\{R(X_i),R(Y_i)\}=\{R_i,R_i\}\), then

\[r_s=\frac{\sum_{i=1}^n(R_i-\bar{R})(S_i-\bar{S})}{\sqrt{\sum_{i=1}^n(R_i-\bar{R})^2 \sum_{i=1}^n(S_i-\bar{S})^2}}\]

When there are no ties, the formula can be simplified a little bit more, However, your >Core i5 9th generation processor, and your >16G of ram won’t even notice it. So, maybe let’s leave it for spring break -_-!.

Back to our example

How to calculate in R

bdims %$% 
  cor.test(bic_gi,for_gi, method="pearson") -> tp1

bdims %$% 
  cor.test(bic_gi,for_gi, method="spearman") -> ts1

bdims %$% 
  cor.test(bic_gi,for_gi, method="kendall") -> tk1


bdims %$% 
  cor.test(bia_di,thi_gi, method="pearson") -> tp2

bdims %$% 
  cor.test(bia_di,thi_gi, method="spearman") -> ts2

bdims %$% 
  cor.test(bia_di,thi_gi, method="kendall") -> tk2


bdims %$% 
  cor.test(che_di,nav_gi, method="pearson") -> tp3

bdims %$% 
  cor.test(che_di,nav_gi, method="spearman") -> ts3

bdims %$% 
  cor.test(che_di,nav_gi, method="kendall") -> tk3

Previous examples

Correlation matrix

# ++++++++++++++++++++++++++++
# flattenCorrMatrix
# ++++++++++++++++++++++++++++
# cormat : matrix of the correlation coefficients
# pmat : matrix of the correlation p-values
flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
    )
}

Flaten correlation matrix

library(Hmisc)
res2 <- rcorr(as.matrix(bdims[-25]))
              
flattenCorrMatrix(res2$r, res2$P) %>% 
  arrange(desc(abs(cor))) %>% 
  head()

##      row column       cor p
## 1 bic_gi for_gi 0.9423755 0
## 2 sho_gi che_gi 0.9271923 0
## 3 che_gi bic_gi 0.9081845 0
## 4 for_gi wri_gi 0.9047086 0
## 5 wai_gi    wgt 0.9039908 0
## 6 che_gi    wgt 0.8989595 0

Cor plot

res <- cor(bdims[-25]) # Sex

library(corrplot)
corrplot(
  res, 
  type = "upper", 
  order = "hclust", 
  tl.col = "black", 
  tl.srt = 45,
  sig.level = 0.0001, 
  insig = "blank"
  )

Cor plot

Heatmap for clustering variables

col<- colorRampPalette(c("blue", "white", "red"))(20)
heatmap(x = res, col = col, symm = TRUE)

Connection with regression

In SLR

\[ SSR = \sum_{i=1}^n(Y_i-\widehat{Y_i})^2= (1-r^2)\underset{SST}{\underbrace{\sum_{i=1}^n(Y_i-\overline{Y_i})^2}}=(1-r^2)SST\]

In MLR

we define the sample multiple correlation coefficient, R

\[R:=\frac{\sum_{i=1}^n(Y_i-\bar{Y})(Y_i-\overline{\widehat Y_i})}{\sqrt{\sum_{i=1}^n(Y_i-\bar{Y})^2 \sum_{i=1}^n(Y_i-\overline{\widehat Y_i}))^2}}\] The squared of it \(R^2\) is known as the coefficient of determination or Multiple R-squared and we could show that

\[R^2=\frac{SST-SSR}{SST}=1-\frac{SSR}{SST}\]

Interpreted as: The proportion of the variance of \(Y\) that is explained by the explanatory variables. (Go back to the graphic examples though)

Further into regression

Not a measure of goodness-of-fit
- R-squared type GOF summary (Colin Cameron and Windmeijer 1997)
Different adjustments
- Adjusted R-squared (number of predictors)
- Predicted R-Squared

Nerdier correlation coefficients!

More coefficients

Also for ranked data, Kendall’s rank correlation coefficient \(\tau\). Which is based on concordant and discordant pairs (Pairs of point that increase or decrease together). There are three versions of it.
Two binary variables \(\implies\) Tetrachoric correlation
Two ordinal variables with latent normal variables \(\implies\) Polychoric correlation
People keep working on better coefficients

Keypoints

Take home

Calculate Pearson’s and Spearman’s coefficients for the examples (3) in SAS
What kind of relationships they measure
Relationship with Regression

Thanks!

References

Çetinkaya-Rundel, Mine, and Johanna Hardin. n.d. “Introduction to Modern Statistics,” 549.

Colin Cameron, A., and Frank A. G. Windmeijer. 1997. “An R-Squared Measure of Goodness of Fit for Some Common Nonlinear Regression Models.” Journal of Econometrics 77 (2): 329–42. https://doi.org/10.1016/S0304-4076(96)01818-0.

Heinz, Grete, Louis J. Peterson, Roger W. Johnson, and Carter J. Kerk. 2003. “Exploring Relationships in Body Dimensions.” Journal of Statistics Education 11 (2): null. https://doi.org/10.1080/10691898.2003.11910711.

“Pearson Correlation Coefficient.” 2022. https://en.wikipedia.org/w/index.php?title=Pearson_correlation_coefficient&oldid=1067643377.

Vu, Julie, and David Harrington. n.d. “Introductory Statistics for the Life and Biomedical Sciences,” 472.