2022-02-17
library(knitr) library(tidyverse) library(magrittr) library(GGally) library(gridExtra) library(openintro) # Loading the data data(bdims) # help(bdims)
certain body measures to be “associated” or “correlated” with each other
certain body measures to be independent
For others, the most certain thing is: who knows -_-!
bdims %>%
ggplot(aes(bic_gi,for_gi)) +
geom_point()+
xlab("bic_gi: bicep girth")+
ylab("for_gi: forearm girth") -> p1
bdims %>%
ggplot(aes(bia_di,thi_gi)) +
geom_point()+
xlab("bia_di: biacromial diameter")+
ylab("thi_gi: thigh girth") -> p2
bdims %>%
ggplot(aes(che_di,nav_gi)) +
geom_point()+
xlab("che_di: chest diameter")+
ylab("nav_gi: navel (abdominal) girth") -> p3
grid.arrange(p1,p2,p3,nrow=1)
The distribution of a RV
Expectation of functions of RVs
Centrality and dispersion
Joint distribution
\[\rho_{XY}:=\frac{Cov(X,Y)}{\sigma_X\sigma_Y}=\frac{Cov(X,Y)}{\sqrt{\sigma_X^2\sigma_Y^2}}\]
That is the covariance divided by the product of the standard deviations.
But the converse is not true.
[Casella-Berger]
For any X,Y RVs
Broadly speaking (it depends on the particular problem) but
| Absolute value of \(\rho\) | Strength of relationship |
|---|---|
| \(|\rho| \lt 0.25\) | No linear relationship |
| \(0.25 \le |\rho| \lt 0.25\) | Weak |
| \(0.5 \le |\rho| \lt 0.75\) | Moderate |
| \(|\rho| \ge 0.75\) | Strong |
\[L(\rho;sample ):=\prod_{i=1}^nf(x_i,y_i;\rho)\]
Typically found using calculus (derivation and set to zero)
If \((X,Y) \sim f(x,y;\rho)\) bivariate normal then the Maximum Likelihood Estimator for \(rho\) is:
\[r=\frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}}=\frac{SS_{XY}}{\sqrt{SS_X SS_Y}}\]
This is known as the Pearson
Confidence intervals are preferred. If testing, think if testing against 0 is appropriate or not in clinical context and sample size in mind.
\(H_0\): \(\rho=\rho_0=0\)
\(H_1\): \(\rho \ne 0\) (two sided)
\(H_1\): \(\rho > 0\) or \(\rho < 0\) (one sided sided)
\(t:=r\sqrt{\frac{n-2}{1-r^2}} \sim t_{n-2}\)
\(Reject ~ H_0 ~ if ~ |t|<t_{n-2,1-\alpha/2}\)
\(Fail ~ reject ~ H_0 ~ if ~ |t|<t_{n-2,1 - \alpha/2}\)
Don’t worry about the complicated formulas, Sir Ronald already did for us -_-!
\(F(r):=\frac{1}{2}ln(\frac{1+r}{1-r})=arctanh(r)\) (Inverse Hyperbolic Tangent)
\(F(r)\) approximately (very quickly) follows a normal distribution \(N(arctanh(\rho),\frac{1}{n-3})\)
First compute a confidence interval for \(F(r)\)
then apply the inverse Fisher transformation to that interval to obtain:
\(100(1-\alpha)\%CI:\) \[[tanh(arctanh(r)-z_{\alpha/2}\frac{1}{\sqrt{n-3}}), tanh(arctanh(r)-z_{\alpha/2}\frac{1}{\sqrt{n-3}})]\]
Outliers
Confirm and clean data
Categorization (Maybe to extreme)
Monotonic relationship (not necessarily in a line): Spearman
Nerdier correlation coefficients, mentioned in a moment
\[r_s:=\rho_{R(X),R(Y)}=\frac{Cov(R(X),R(Y))}{\sigma_{R(X)}\sigma_{R(Y)}}\] - From the sample:
First rank the data: Sort the data \(\rightarrow\) which position they have
Handle ties: Say two people tie in positions 2 and 3, assign the average of \(\{2,3\}\), that is those first 4 values will be \((1,2.5,2.5,4)\).
Use Pearson on the ranked data: Let \(\{R(X_i),R(Y_i)\}=\{R_i,R_i\}\), then
\[r_s=\frac{\sum_{i=1}^n(R_i-\bar{R})(S_i-\bar{S})}{\sqrt{\sum_{i=1}^n(R_i-\bar{R})^2 \sum_{i=1}^n(S_i-\bar{S})^2}}\]
bdims %$% cor.test(bic_gi,for_gi, method="pearson") -> tp1 bdims %$% cor.test(bic_gi,for_gi, method="spearman") -> ts1 bdims %$% cor.test(bic_gi,for_gi, method="kendall") -> tk1 bdims %$% cor.test(bia_di,thi_gi, method="pearson") -> tp2 bdims %$% cor.test(bia_di,thi_gi, method="spearman") -> ts2 bdims %$% cor.test(bia_di,thi_gi, method="kendall") -> tk2 bdims %$% cor.test(che_di,nav_gi, method="pearson") -> tp3 bdims %$% cor.test(che_di,nav_gi, method="spearman") -> ts3 bdims %$% cor.test(che_di,nav_gi, method="kendall") -> tk3
# ++++++++++++++++++++++++++++
# flattenCorrMatrix
# ++++++++++++++++++++++++++++
# cormat : matrix of the correlation coefficients
# pmat : matrix of the correlation p-values
flattenCorrMatrix <- function(cormat, pmat) {
ut <- upper.tri(cormat)
data.frame(
row = rownames(cormat)[row(cormat)[ut]],
column = rownames(cormat)[col(cormat)[ut]],
cor =(cormat)[ut],
p = pmat[ut]
)
}
library(Hmisc)
res2 <- rcorr(as.matrix(bdims[-25]))
flattenCorrMatrix(res2$r, res2$P) %>%
arrange(desc(abs(cor))) %>%
head()
## row column cor p ## 1 bic_gi for_gi 0.9423755 0 ## 2 sho_gi che_gi 0.9271923 0 ## 3 che_gi bic_gi 0.9081845 0 ## 4 for_gi wri_gi 0.9047086 0 ## 5 wai_gi wgt 0.9039908 0 ## 6 che_gi wgt 0.8989595 0
res <- cor(bdims[-25]) # Sex
library(corrplot) corrplot( res, type = "upper", order = "hclust", tl.col = "black", tl.srt = 45, sig.level = 0.0001, insig = "blank" )
col<- colorRampPalette(c("blue", "white", "red"))(20)
heatmap(x = res, col = col, symm = TRUE)
\[ SSR = \sum_{i=1}^n(Y_i-\widehat{Y_i})^2= (1-r^2)\underset{SST}{\underbrace{\sum_{i=1}^n(Y_i-\overline{Y_i})^2}}=(1-r^2)SST\]
we define the sample multiple correlation coefficient, R
\[R:=\frac{\sum_{i=1}^n(Y_i-\bar{Y})(Y_i-\overline{\widehat Y_i})}{\sqrt{\sum_{i=1}^n(Y_i-\bar{Y})^2 \sum_{i=1}^n(Y_i-\overline{\widehat Y_i}))^2}}\] The squared of it \(R^2\) is known as the coefficient of determination or Multiple R-squared and we could show that
\[R^2=\frac{SST-SSR}{SST}=1-\frac{SSR}{SST}\]
Not a measure of goodness-of-fit
Different adjustments
Also for ranked data, Kendall’s rank correlation coefficient \(\tau\). Which is based on concordant and discordant pairs (Pairs of point that increase or decrease together). There are three versions of it.
Two binary variables \(\implies\) Tetrachoric correlation
Two ordinal variables with latent normal variables \(\implies\) Polychoric correlation
People keep working on better coefficients
Çetinkaya-Rundel, Mine, and Johanna Hardin. n.d. “Introduction to Modern Statistics,” 549.
Colin Cameron, A., and Frank A. G. Windmeijer. 1997. “An R-Squared Measure of Goodness of Fit for Some Common Nonlinear Regression Models.” Journal of Econometrics 77 (2): 329–42. https://doi.org/10.1016/S0304-4076(96)01818-0.
Heinz, Grete, Louis J. Peterson, Roger W. Johnson, and Carter J. Kerk. 2003. “Exploring Relationships in Body Dimensions.” Journal of Statistics Education 11 (2): null. https://doi.org/10.1080/10691898.2003.11910711.
“Pearson Correlation Coefficient.” 2022. https://en.wikipedia.org/w/index.php?title=Pearson_correlation_coefficient&oldid=1067643377.
Vu, Julie, and David Harrington. n.d. “Introductory Statistics for the Life and Biomedical Sciences,” 472.