2022-02-17

Motivation

Let’s study our body measures

Body measurements with R

(Heinz et al. 2003)

library(knitr)
library(tidyverse)
library(magrittr)
library(GGally)
library(gridExtra)
library(openintro)

# Loading the data
data(bdims)
# help(bdims)

Codebook

Expectation

We expect:

  • certain body measures to be “associated” or “correlated” with each other

    • bic_gi (bicep girth) vs for_gi (forearm girth)

We expect:

  • certain body measures to be independent

    • bia_di (biacromial diameter) vs thi_gi (thigh girth)

Sometimes there is doubt

For others, the most certain thing is: who knows -_-!

  • che_di (chest diameter) vs nav_gi (navel (abdominal) girth)

Let look at the data

Some plots

bdims %>% 
  ggplot(aes(bic_gi,for_gi)) +
  geom_point()+
  xlab("bic_gi: bicep girth")+
  ylab("for_gi: forearm girth") -> p1

bdims %>% 
  ggplot(aes(bia_di,thi_gi)) +
  geom_point()+
  xlab("bia_di: biacromial diameter")+
  ylab("thi_gi: thigh girth") -> p2

bdims %>% 
  ggplot(aes(che_di,nav_gi)) +
  geom_point()+
  xlab("che_di: chest diameter")+
  ylab("nav_gi: navel (abdominal) girth") -> p3

Let’s look at the data

grid.arrange(p1,p2,p3,nrow=1)

How do we measure these associations?

Recall

The distribution of a RV

  • cdf: \(F_X(x):=P(X \le x)\)
  • pdf: \(f_X\) is such that \(F_X(x)=\int_{-\infty}^x f_X(t)dt\)
  • Normal case \(X \sim N(\mu,\sigma^2)\) if \(f_X(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}\)

Expectation of functions of RVs

  • \(E_{f_X}(g(X)):=\int_{-\infty}^{\infty} g(x)f_X(t)dt\)

Centrality and dispersion

  • \(E_{f_X}(X):=\int_{-\infty}^{\infty} xf_X(t)dt=\mu_X\)
  • \(Var(X)=E(X-\mu_X)^2=E(X^2)-\mu_X^2=\sigma_X^2\)

Two RVs

Joint distribution

  • cdf:\(F_{X,Y}(x,y):=P(X\le x, Y\le y)\)
  • pdf: \(f_{X,Y}\) is such that \(F_{X,Y}(x,y)=\int_{-\infty}^y \int_{-\infty}^x f_{X,Y}(s,t)dsdt\)
  • Expectation: \(Eg(X,Y))=\int_{-\infty}^\infty \int_{-\infty}^\infty g(x,y) f_{X,Y}(x,y)dxdy\)
  • Conditional distribution: \(f_{X|Y}(y|x):=\frac{f_{X,Y}(x,y)}{f_{X}(x)}\)
  • Independence: \(f_{X,Y}(x,y)=f_{X}(x)f_{Y}(y)\) \(\Rightarrow\) \(f_{Y|X}(y|x)=f_{Y}(y)\)

Covariance (today’s business)

  • \(X \sim f_{X}(x)\), \(Y \sim f_{Y}(y)\), and \((X,Y) \sim f_{X,Y}(x,y)\)
  • \(E_{f_X}(X)=\mu_X\), \(E_{f_Y}(Y)=\mu_X\), and \(E_{f_{XY}}(XY)=E(XY)\) (marginal vs joint)
  • Define
  • \[Cov(X,Y):=E_{f_{XY}}\{(X-\mu_X)(Y-\mu_Y)\}=E_{f_{XY}}(XY)-\mu_X\mu_Y\]
  • Exercise 1: Prove the latest equality without getting rid of the subscript at \(E_{f_{XY}}(.)\), i.e. in between, show that \(E_{f_{XY}}(Y)=\mu_Y\) and then \(E_{f_{XY}}(X)=\mu_X\).

Correlation

\[\rho_{XY}:=\frac{Cov(X,Y)}{\sigma_X\sigma_Y}=\frac{Cov(X,Y)}{\sqrt{\sigma_X^2\sigma_Y^2}}\]

That is the covariance divided by the product of the standard deviations.

Properties

  • Independence \(\implies\) \(Cov(X,Y)=0\) \(\implies\) \(\rho_{XY}=0\) (Exercise 2)
  • \(var(aX+bY)=a^2VarX + a^2VarX + 2abCov(X,Y)\)
  • Independence \(\implies var(aX+bY)=a^2VarX + a^2VarX\)

Properties

But the converse is not true.

  • \(\rho_{XY}=0\) \(\nRightarrow\) Independence
  • Why?

Theorem 4.5.7

[Casella-Berger]

For any X,Y RVs

  • \(-1 \le \rho_{X,Y} \le 1\)
  • \(|\rho_{X,Y}|=1 \iff\) there is a line such that \(P(Y=aX+b)=1\) (\(a \ne 0\) and \(b\))
  • \(\rho_{X,Y}=1 \implies a \gt 0\) (X increases \(\implies\) Y also increases)
  • \(\rho_{X,Y}=-1 \implies a \lt 0\) (X increases \(\implies\) Y decreases)

So,

  • We are measuring a linear relationship
  • Not any other type of relationship
  • Dimensionless
  • Causation is another thing

Graphical example 1

Graphical example 2

Graphical example 3

Graphical example 4

Cutoffs

Broadly speaking (it depends on the particular problem) but

Absolute value of \(\rho\) Strength of relationship
\(|\rho| \lt 0.25\) No linear relationship
\(0.25 \le |\rho| \lt 0.25\) Weak
\(0.5 \le |\rho| \lt 0.75\) Moderate
\(|\rho| \ge 0.75\) Strong

Inference

Point estimation for the correlation

  • Maximum Likelihood Estimator: Point estimate that maximizes the likelihood given the sample that you have

\[L(\rho;sample ):=\prod_{i=1}^nf(x_i,y_i;\rho)\]

  • Typically found using calculus (derivation and set to zero)

  • If \((X,Y) \sim f(x,y;\rho)\) bivariate normal then the Maximum Likelihood Estimator for \(rho\) is:

\[r=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}=\frac{SS_{XY}}{\sqrt{SS_X SS_Y}}\]

This is known as the Pearson correlation coefficient

Hypothesis test and Confidence Intervals

  • Permutation test (without replacement)
  • Bootstrap (with replacement)
  • Exact distribution (\(r \sim Normal_2\))
  • Student’s t-distribution (tipically used to test against zero)
  • Fisher transformation (Mostly used, but…)

Confidence intervals are preferred. If testing, think if testing against 0 is appropriate or not in clinical context and sample size in mind.

t-distribution

\(H_0\): \(\rho=\rho_0=0\)

\(H_1\): \(\rho \ne 0\) (two sided)

\(H_1\): \(\rho > 0\) or \(\rho < 0\) (one sided sided)

\(t:=r\sqrt{\frac{n-2}{1-r^2}} \sim t_{n-2}\)

\(Reject ~ H_0 ~ if ~ |t|<t_{n-2,1-\alpha/2}\)

\(Fail ~ reject ~ H_0 ~ if ~ |t|<t_{n-2,1 - \alpha/2}\)

The founder: - Excuse me!

Fisher transformation

Don’t worry about the complicated formulas, Sir Ronald already did for us -_-!

  • \(F(r):=\frac{1}{2}ln(\frac{1+r}{1-r})=arctanh(r)\) (Inverse Hyperbolic Tangent)

  • \(F(r)\) approximately (very quickly) follows a normal distribution \(N(arctanh(\rho),\frac{1}{n-3})\)

  • First compute a confidence interval for \(F(r)\)

  • then apply the inverse Fisher transformation to that interval to obtain:

  • \(100(1-\alpha)\%CI:\) \[[tanh(arctanh(r)-z_{\alpha/2}\frac{1}{\sqrt{n-3}}), tanh(arctanh(r)-z_{\alpha/2}\frac{1}{\sqrt{n-3}})]\]

Other types of correlations

Watch out!

  • When you have data that deviate too much from your model assumptions (multivariate normality, conditional normality)
  • Ordinal variables
  • Outliers

  • People with influences

Alternatives

  • Confirm and clean data

  • Categorization (Maybe to extreme)

  • Monotonic relationship (not necessarily in a line): Spearman

  • Nerdier correlation coefficients, mentioned in a moment

Spearman Correlation

  • At population level it is defined for the ranked variables as:

\[r_s:=\rho_{R(X),R(Y)}=\frac{Cov(R(X),R(Y))}{\sigma_{R(X)}\sigma_{R(Y)}}\] - From the sample:

  • First rank the data: Sort the data \(\rightarrow\) which position they have

  • Handle ties: Say two people tie in positions 2 and 3, assign the average of \(\{2,3\}\), that is those first 4 values will be \((1,2.5,2.5,4)\).

  • Use Pearson on the ranked data: Let \(\{R(X_i),R(Y_i)\}=\{R_i,R_i\}\), then

\[r_s=\frac{\sum_{i=1}^n(R_i-\bar{R})(S_i-\bar{S})}{\sqrt{\sum_{i=1}^n(R_i-\bar{R})^2 \sum_{i=1}^n(S_i-\bar{S})^2}}\]

  • When there are no ties, the formula can be simplified a little bit more, However, your >Core i5 9th generation processor, and your >16G of ram won’t even notice it. So, maybe let’s leave it for spring break -_-!.

Back to our example

How to calculate in R

bdims %$% 
  cor.test(bic_gi,for_gi, method="pearson") -> tp1

bdims %$% 
  cor.test(bic_gi,for_gi, method="spearman") -> ts1

bdims %$% 
  cor.test(bic_gi,for_gi, method="kendall") -> tk1


bdims %$% 
  cor.test(bia_di,thi_gi, method="pearson") -> tp2

bdims %$% 
  cor.test(bia_di,thi_gi, method="spearman") -> ts2

bdims %$% 
  cor.test(bia_di,thi_gi, method="kendall") -> tk2


bdims %$% 
  cor.test(che_di,nav_gi, method="pearson") -> tp3

bdims %$% 
  cor.test(che_di,nav_gi, method="spearman") -> ts3

bdims %$% 
  cor.test(che_di,nav_gi, method="kendall") -> tk3

Previous examples

Correlation matrix

# ++++++++++++++++++++++++++++
# flattenCorrMatrix
# ++++++++++++++++++++++++++++
# cormat : matrix of the correlation coefficients
# pmat : matrix of the correlation p-values
flattenCorrMatrix <- function(cormat, pmat) {
  ut <- upper.tri(cormat)
  data.frame(
    row = rownames(cormat)[row(cormat)[ut]],
    column = rownames(cormat)[col(cormat)[ut]],
    cor  =(cormat)[ut],
    p = pmat[ut]
    )
}

Flaten correlation matrix

library(Hmisc)
res2 <- rcorr(as.matrix(bdims[-25]))
              
flattenCorrMatrix(res2$r, res2$P) %>% 
  arrange(desc(abs(cor))) %>% 
  head()
##      row column       cor p
## 1 bic_gi for_gi 0.9423755 0
## 2 sho_gi che_gi 0.9271923 0
## 3 che_gi bic_gi 0.9081845 0
## 4 for_gi wri_gi 0.9047086 0
## 5 wai_gi    wgt 0.9039908 0
## 6 che_gi    wgt 0.8989595 0

Cor plot

res <- cor(bdims[-25]) # Sex
library(corrplot)
corrplot(
  res, 
  type = "upper", 
  order = "hclust", 
  tl.col = "black", 
  tl.srt = 45,
  sig.level = 0.0001, 
  insig = "blank"
  )

Cor plot

Heatmap for clustering variables

col<- colorRampPalette(c("blue", "white", "red"))(20)
heatmap(x = res, col = col, symm = TRUE)

Connection with regression

In SLR

\[ SSR = \sum_{i=1}^n(Y_i-\widehat{Y_i})^2= (1-r^2)\underset{SST}{\underbrace{\sum_{i=1}^n(Y_i-\overline{Y_i})^2}}=(1-r^2)SST\]

In MLR

we define the sample multiple correlation coefficient, R

\[R:=\frac{\sum_{i=1}^n(Y_i-\bar{Y})(Y_i-\overline{\widehat Y_i})}{\sqrt{\sum_{i=1}^n(Y_i-\bar{Y})^2 \sum_{i=1}^n(Y_i-\overline{\widehat Y_i}))^2}}\] The squared of it \(R^2\) is known as the coefficient of determination or Multiple R-squared and we could show that

\[R^2=\frac{SST-SSR}{SST}=1-\frac{SSR}{SST}\]

  • Interpreted as: The proportion of the variance of \(Y\) that is explained by the explanatory variables. (Go back to the graphic examples though)

Further into regression

  • Not a measure of goodness-of-fit

  • Different adjustments

    • Adjusted R-squared (number of predictors)
    • Predicted R-Squared

Nerdier correlation coefficients!

More coefficients

  • Also for ranked data, Kendall’s rank correlation coefficient \(\tau\). Which is based on concordant and discordant pairs (Pairs of point that increase or decrease together). There are three versions of it.

  • Two binary variables \(\implies\) Tetrachoric correlation

  • Two ordinal variables with latent normal variables \(\implies\) Polychoric correlation

  • People keep working on better coefficients

Keypoints

Take home

  • Calculate Pearson’s and Spearman’s coefficients for the examples (3) in SAS
  • What kind of relationships they measure
  • Relationship with Regression

Thanks!

References

Çetinkaya-Rundel, Mine, and Johanna Hardin. n.d. “Introduction to Modern Statistics,” 549.

Colin Cameron, A., and Frank A. G. Windmeijer. 1997. “An R-Squared Measure of Goodness of Fit for Some Common Nonlinear Regression Models.” Journal of Econometrics 77 (2): 329–42. https://doi.org/10.1016/S0304-4076(96)01818-0.

Heinz, Grete, Louis J. Peterson, Roger W. Johnson, and Carter J. Kerk. 2003. “Exploring Relationships in Body Dimensions.” Journal of Statistics Education 11 (2): null. https://doi.org/10.1080/10691898.2003.11910711.

“Pearson Correlation Coefficient.” 2022. https://en.wikipedia.org/w/index.php?title=Pearson_correlation_coefficient&oldid=1067643377.

Vu, Julie, and David Harrington. n.d. “Introductory Statistics for the Life and Biomedical Sciences,” 472.