
Today we will go over what correlation measures and some of the examples from Chapter 6.
The plots from Chapter 6
In introductory Statististics courses the difference between Correlation and Causation is discussed. These two ideas are not the same.
Often it is said, “Correlation is not causation.”
The Correlation Coeficient, r, measures the strength and direction of the linear association between two quantitative variables.
Memorize this!
Good interview question.
One variable causes an effect, linear or non-linear, on another another variable.
A confounding variable is another variable that influences the other variables.
### synthetic data
# Consider book price (y) by number of pages (x)
z = c("hardcover","hardcover",
"hardcover","hardcover",
"paperback", "paperback","paperback",
"paperback")
x1 = c( 150, 225, 342, 185)
y1 = c( 27.43, 48.76, 50.25, 32.01 )
x2 = c( 475, 834, 1020, 790)
y2 = c( 10.00, 15.73, 20.00, 17.89 )
x = c(x1, x2)
y = c(y1, y2)
plot(x,y)
# correlation
cor(y, x)
[1] -0.5949366
cor(y1, x1)
[1] 0.8481439
cor(y2, x2)
[1] 0.9559518
# linear regression
lm(y ~ x)
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
41.15238 -0.02665
# linear regression
lm(y1 ~ x1)
Call:
lm(formula = y1 ~ x1)
Coefficients:
(Intercept) x1
13.0613 0.1177
# linear regression
lm(y2 ~ x2)
Call:
lm(formula = y2 ~ x2)
Coefficients:
(Intercept) x2
1.72389 0.01819
Summary: Simpson's Paradox is the changing of the direction of a relationship with the introduction of another variable.
The relationship between Price and Number of pages in a book changes with the introduction of the variable Type of Book (Hardcover, Paperback).
See the R Markdown document Simpsons Paradox.