Paired Data

Bivariate data that are either coupled, or matched together.

They are not necessarily independent.

require(UsingR)
## Loading required package: UsingR
## Warning: package 'UsingR' was built under R version 4.4.1
## Loading required package: MASS
## Loading required package: HistData
## Warning: package 'HistData' was built under R version 4.4.1
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 4.4.1
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
names(fat)
##  [1] "case"          "body.fat"      "body.fat.siri" "density"      
##  [5] "age"           "weight"        "height"        "BMI"          
##  [9] "ffweight"      "neck"          "chest"         "abdomen"      
## [13] "hip"           "thigh"         "knee"          "ankle"        
## [17] "bicep"         "forearm"       "wrist"
plot(fat$wrist, fat$neck)

par(mfrow=c(1,2))
plot(neck~wrist, data=fat)
plot(neck~wrist, data=fat, subset=20<=age &age <30)

plot(fat$wrist, fat$neck)
abline(v=mean(fat$wrist))
abline(h=mean(fat$neck))
points(mean(fat$wrist), mean(fat$neck), pch=16, col=rgb(.35,0,0))

If related, then most of data should be in first and third box, or second and fourth box!

Pearson’s Correlation Coefficient

cor(fat$wrist, fat$neck)
## [1] 0.7448264
require(MASS)
plot(Animals$body,Animals$brain)

cor(Animals$body,Animals$brain)
## [1] -0.005341163
cor(rank(Animals$body), rank(Animals$brain))
## [1] 0.7162994

https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

Any benefits over Pearson’s correlation?

Chocolates and Noble Prizes

Note: Countries with more per capita chocolate consumption have more per capita Nobel laureates.

Conclude: Chocolate consumption cause better scientific research!

Causation

https://en.wikipedia.org/wiki/Causality

https://medium.com/@seema.singh/why-correlation-does-not-imply-causation-5b99790df07e

Spurious: Facebook users and marks of users

Causality: Smoking and lung cancer, Wine and heart risk.

Pearson correlation coefficient is a measure of the linearity of the (possible) relationship between two variables.

Even if correlation coefficient is high, it does not mean there is causal relationship between the two variables. Does not tell you cause and effect!

Care to be taken when used for predictive purposes.

Causality: Domain knowledge, design a good control experiment.

https://www.tylervigen.com/spurious-correlations

Linear Regression

Squared loss:

#y = read.csv("annual_temp.csv", header=TRUE)
#head(y)

#plot(Temp ~ CO2, data=y, pch=19,cex=.5, col="#440154")
#abline(lm(Temp ~ CO2, data=y), col="#21918c")

require(MASS)
plot(calls ~ year, data=phones, pch=19,cex=.5, col="#440154")
abline(lm(calls ~ year, data=phones), col="#21918c")

ABSMINLINE = function(x)
{ with (phones, sum(abs(calls- x[1] -x[2]*year)))}
OPTIMAL = optim(c(0,0), fn = ABSMINLINE)

abline(lm(calls ~ year, data=phones), col="#21918c")
abline(OPTIMAL$par, col="#3b528b")
abline(rlm(calls ~ year, data=phones), col="#fc8961")
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

par(mfrow=c(2,2))
plot(lm(calls ~ year, data=phones), col="#440154")

Some pointers

Gauss