Bivariate data that are either coupled, or matched together.
They are not necessarily independent.
require(UsingR)
## Loading required package: UsingR
## Warning: package 'UsingR' was built under R version 4.4.1
## Loading required package: MASS
## Loading required package: HistData
## Warning: package 'HistData' was built under R version 4.4.1
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 4.4.1
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
names(fat)
## [1] "case" "body.fat" "body.fat.siri" "density"
## [5] "age" "weight" "height" "BMI"
## [9] "ffweight" "neck" "chest" "abdomen"
## [13] "hip" "thigh" "knee" "ankle"
## [17] "bicep" "forearm" "wrist"
plot(fat$wrist, fat$neck)
par(mfrow=c(1,2))
plot(neck~wrist, data=fat)
plot(neck~wrist, data=fat, subset=20<=age &age <30)
plot(fat$wrist, fat$neck)
abline(v=mean(fat$wrist))
abline(h=mean(fat$neck))
points(mean(fat$wrist), mean(fat$neck), pch=16, col=rgb(.35,0,0))
If related, then most of data should be in first and third box, or second and fourth box!
cor(fat$wrist, fat$neck)
## [1] 0.7448264
require(MASS)
plot(Animals$body,Animals$brain)
cor(Animals$body,Animals$brain)
## [1] -0.005341163
cor(rank(Animals$body), rank(Animals$brain))
## [1] 0.7162994
https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
Any benefits over Pearson’s correlation?
Note: Countries with more per capita chocolate consumption have more per capita Nobel laureates.
Conclude: Chocolate consumption cause better scientific research!
https://en.wikipedia.org/wiki/Causality
https://medium.com/@seema.singh/why-correlation-does-not-imply-causation-5b99790df07e
Spurious: Facebook users and marks of users
Causality: Smoking and lung cancer, Wine and heart risk.
Pearson correlation coefficient is a measure of the linearity of the (possible) relationship between two variables.
Even if correlation coefficient is high, it does not mean there is causal relationship between the two variables. Does not tell you cause and effect!
Care to be taken when used for predictive purposes.
Causality: Domain knowledge, design a good control experiment.
Squared loss:
#y = read.csv("annual_temp.csv", header=TRUE)
#head(y)
#plot(Temp ~ CO2, data=y, pch=19,cex=.5, col="#440154")
#abline(lm(Temp ~ CO2, data=y), col="#21918c")
require(MASS)
plot(calls ~ year, data=phones, pch=19,cex=.5, col="#440154")
abline(lm(calls ~ year, data=phones), col="#21918c")
ABSMINLINE = function(x)
{ with (phones, sum(abs(calls- x[1] -x[2]*year)))}
OPTIMAL = optim(c(0,0), fn = ABSMINLINE)
abline(lm(calls ~ year, data=phones), col="#21918c")
abline(OPTIMAL$par, col="#3b528b")
abline(rlm(calls ~ year, data=phones), col="#fc8961")
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps
par(mfrow=c(2,2))
plot(lm(calls ~ year, data=phones), col="#440154")