6/8/2025

Introduction

  • In data science, correlation and causation are often confused.
  • Just because two things are related, doesn’t mean one causes the other.
  • Misinterpreting this can lead to poor decisions in business, health, and policy.

What is Correlation?

  • Correlation measures the strength and direction of a linear relationship between two variables.
  • Values range from -1 to 1:
    • +1: Perfect positive linear relationship
    • 0: No linear relationship

–1: Perfect negative linear relationship

What is Causation?

  • Causation means that one variable directly affects another.
  • In statistics, we cannot assume causation from correlation alone.
  • To establish causation, we need:
    • Controlled experiments
    • Randomized trials
    • Statistical models that account for confounding variables

Real-Life Example:

  • Ice cream sales and drowning incidents increase during summer.
  • Are they causally related? ❌ No.
  • Hidden variable: Temperature. ☀️

This is an example of a spurious correlation — where two things appear related but are not causally connected.

3D Correlation Example

This 3D plot shows how three numeric variables can have relationships between them.

Correlation Coefficient Formula

The Pearson correlation coefficient \(r\) is calculated as:

\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum (x_i - \bar{x})^2} \sqrt{\sum (y_i - \bar{y})^2}} \]

  • \(\bar{x}\), \(\bar{y}\): Means of \(x\) and \(y\)
  • The result is always between –1 and 1

A strong correlation (close to –1 or 1) suggests a strong linear relationship.

Correlation ≠ Causation

Sometimes, variables show correlation by coincidence, not causation.

Summary: Correlation vs. Causation

  • Correlation tells us two variables move together, but not why.
  • Causation means one variable directly affects the other.
  • To infer causation, we need:
    • Randomized experiments
    • Control for confounders
    • Strong statistical design

\[ \text{Causal Effect: } \Delta Y = Y_1 - Y_0 \]

  • \(Y_1\): Outcome if treatment is given
  • \(Y_0\): Outcome if treatment is not given
  • This shows how much the treatment causes change in the outcome