2025-06-09

1

What is correlation?

Correlation refers to the measurement of the strength and direction of a linear relationship between two variables.

  • It ranges from -1 to 1
  • A positive correlation means that as one variable increases, the other also increases
  • A negative correlation means that as one variable decreases, the other increases
  • A correlation of 0 means there is no linear relationship between the 2 variables

What is causation?

Causation means a change in one variable directly causes a change in another variable.

  • Just because two things are correlated does not mean one causes the other
  • Example: ice cream sales and shark attacks both increase in summer, but one doesn’t cause the other

Math: Correlation Coefficient

\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \sqrt{\sum (y_i - \bar{y})^2}} \]

This formula calculates Pearson’s correlation coefficient. This tells us how closely two continuous variables move together.

Math: Coefficient of Determination

\[ R^2 = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} \]

Where

  • \(\text{SS}_{\text{res}}\) is the residual sum of squares.
  • \(\text{SS}_{\text{tot}}\) is the total sum of squares.

This tells us the proportion of variance in the dependent variable that can be explained by the independent variable.

ggplot: Strong Correlation Code (Slide 1/2)

library(ggplot2)
set.seed(1)
x = 1:100
y = 2 * x + rnorm(100, 0, 10)
df1 = data.frame(x, y)
ggplot(df1, aes(x, y)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, col = "pink") +
  ggtitle("Strong Correlation")

ggplot: Strong Correlation (Slide 2/2)

ggplot: No Correlation Code (Slide 1/2)

y_random = rnorm(100)
df2 = data.frame(x, y_random)
ggplot(df2, aes(x, y_random)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, col = "blue") +
  ggtitle("No Correlation")

ggplot: No Correlation (Slide 2/2)

## `geom_smooth()` using formula = 'y ~ x'

plotly Plot: Exploring Correlation Code (Slide 1/2)

library(plotly)

set.seed(42)
x = rnorm(100)
y = 2 * x + rnorm(100)
z = 3 * x + rnorm(100)

df = data.frame(x, y, z)

plot_ly(df, x = ~x, y = ~y, z = ~z,
        type = "scatter3d", mode = "markers",
        marker = list(size = 3, color = ~z, colorscale = "Viridis"))

plotly Plot: Exploring Correlation Code (Slide 2/2)

Misleading Correlation Examples

  • Spurious Correlation: When two variables look correlated but have no logical connection.
  • Example: Nicolas Cage movies and swimming pool drownings (both decline over time)
  • Use caution when interpreting correlation in observational data.

Conclusion

  • Correlation is useful but not proof of causation.
  • If two variables have a causal relationship, then they are also correlated (but the reverse is not always true).
  • Always ask if another factor could explain the relationship.
  • Statistical models and experimental designs help us infer causation more carefully.