Understanding Correlation: Pearson, Spearman, and Kendall’s Tau

Statistics
R Programming
Analytics
Quantitative Methods

A hands-on guide to the three most used correlation methods in R — what they measure, when to use each one, and how to calculate them from first principles.


What is Correlation?

Correlation is one of the most natural starting points when exploring data. It answers a deceptively simple question: do these two things tend to move together?

Think about studying and exam scores — if you study more, do you tend to score higher? Or consider ice cream sales and drowning incidents. Believe it or not, those two are positively correlated. But no one thinks ice cream causes drowning. Warmer weather drives both. That’s a classic spurious correlation — variables moving in sync for a hidden reason.

Key principle

Correlation tells you whether and how strongly two variables are associated — not why. Always be skeptical of assuming causation from correlation alone.

The correlation coefficient captures both the direction and strength of a relationship in a single number, always ranging from −1 to +1:

Value Meaning
+1 Perfect positive relationship
0 No relationship
−1 Perfect negative relationship

In this article, we walk through three methods for calculating correlation — Pearson, Spearman, and Kendall’s tau — each suited to different types of data and relationships. We’ll also build each one from first principles so the mechanics become clear.

Packages used
library(ggplot2)  # Plotting
library(dplyr)    # Data wrangling
library(tidyr)    # Data reshaping

Pearson Correlation

The Pearson correlation coefficient — denoted r — is the workhorse of correlation. It measures the strength and direction of a linear relationship between two continuous variables.

When to use it

Before calculating Pearson correlation, three assumptions should be met:

  1. Both variables are continuous
  2. The relationship is approximately linear
  3. There are no extreme outliers
Tip

For non-normal data, some researchers recommend transforming the variables first. Others argue Pearson is robust to mild normality violations (Havlicek & Peterson, 1976). When in doubt, visualize with a scatter plot — your eyes are often the best check.

These assumptions are ideal conditions, but Pearson is often reasonably robust to mild violations.

Step 1 — Generate and visualize the data

We generate two variables, x and y, where y is linearly related to x plus random noise. A scatter plot first confirms linearity.

set.seed(404)

x <- rnorm(100, mean = 8, sd = 2)
y <- 0.5 * x + rnorm(100, sd = 1)

plot(x, y,
     main = "Scatterplot of x vs y",
     xlab = "x", ylab = "y",
     pch = 19, col = "steelblue")

Scatter plot of x vs y. The upward trend suggests a positive linear relationship.

Good. The upward trend is clear — as x increases, y tends to increase too.

Step 2 — Mean-centering

The core idea behind Pearson correlation is mean-centering: for each value, subtract the variable’s mean. This shifts the data so the center is zero. What remains tells us how each observation sits above or below average.

# Step 1: Find the means
mean_x <- mean(x)
mean_y <- mean(y)

cat("Mean of x:", round(mean_x, 4), "\n")
Mean of x: 8.3025 
cat("Mean of y:", round(mean_y, 4), "\n")
Mean of y: 4.098 
# Step 2: Subtract the mean from each value
dev_x <- x - mean_x
dev_y <- y - mean_y

After centering, positive deviations mean “above average” and negative deviations mean “below average.” Now imagine multiplying each pair of deviations together — this reveals whether both variables move in the same direction at the same time.

The quadrant intuition

When we plot the mean-centered deviations, the plane divides into four quadrants. The sign of the product tells us the direction of co-movement:

  • Quadrant I (+, +): Both above average → positive product ✓
  • Quadrant II (−, +): x below avg, y above → negative product
  • Quadrant III (−, −): Both below average → positive product ✓
  • Quadrant IV (+, −): x above avg, y below → negative product

If most points fall in Quadrants I and III, the products sum to a large positive value — indicating positive correlation.

par(mfrow = c(1, 2))

# Left: original data with mean lines
plot(x, y,
     main = "Raw data with mean lines",
     xlab = "x", ylab = "y",
     pch = 19, col = "steelblue")
abline(h = mean_y, col = "black", lty = 3, lwd = 2)
abline(v = mean_x, col = "black", lty = 6, lwd = 2)
legend("topright",
       legend = c("Mean of x", "Mean of y"),
       col = "black", lty = c(6, 3), bty = "n")

# Right: centered deviations coloured by product sign
prod_dev <- dev_x * dev_y
colors   <- ifelse(prod_dev > 0, "salmon", "cornflowerblue")

plot(dev_x, dev_y,
     main = "Mean-centered deviations",
     xlab = "Deviations x", ylab = "Deviations y",
     pch = 19, col = colors)
abline(h = 0, col = "black", lty = 3, lwd = 2)
abline(v = 0, col = "black", lty = 6, lwd = 2)

# Draw vectors from the origin
for (i in seq_along(dev_x)) {
  lines(c(0, dev_x[i]), c(0, dev_y[i]), col = "gray80", lty = 3)
}

# Quadrant labels
x_off <- max(abs(dev_x)) * 0.65
y_off <- max(abs(dev_y)) * 0.65
text( x_off,  y_off, "Quadrant I (+,+)",   cex = 0.8)
text(-x_off,  y_off, "Quadrant II (-,+)",  cex = 0.8)
text(-x_off, -y_off, "Quadrant III (-,-)", cex = 0.8)
text( x_off, -y_off, "Quadrant IV (+,-)",  cex = 0.8)

Left: raw data with mean lines. Right: mean-centered deviations coloured by product sign. Salmon = positive product (same direction), blue = negative product (opposite direction).
par(mfrow = c(1, 1))

Step 3 — Calculate by hand

\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \cdot \sum (y_i - \bar{y})^2}} \]

# Numerator: sum of products of deviations
sum_product_dev <- sum(dev_x * dev_y)

# Denominator: square root of product of summed squared deviations
sum_sq_x <- sum(dev_x^2)
sum_sq_y <- sum(dev_y^2)

# Pearson's r
r_manual <- sum_product_dev / sqrt(sum_sq_x * sum_sq_y)
cat("Pearson r (manual):", round(r_manual, 7), "\n")
Pearson r (manual): 0.6791683 

We can verify this instantly with cor():

cor(x, y, method = "pearson")
[1] 0.6791683
Interpretation

A value of 0.68 indicates a moderate positive linear relationship. Both variables tend to rise together, though there is meaningful scatter around the trend line.


Spearman Correlation

What if the relationship between your variables isn’t a straight line? Maybe it’s exponential — always increasing, but not at a constant rate. Pearson might still detect it, but Spearman is specifically designed for this: it measures monotonic relationships(Spearman measures the strength of monotonic association, not the exact functional form of the relationship).

A monotonic relationship is one where the variables consistently move in the same direction. The rate of change doesn’t have to be constant — it just can’t reverse direction.

Types of monotonic relationships

# Increasing
x1      <- seq(1, 100, by = 1)
y1_base <- exp(x1 / 20)
y1      <- y1_base + rnorm(100, 0, 10)

# Decreasing
x2      <- seq(0, 10, length.out = 100)
y2_base <- 10 * exp(-0.5 * x2)
y2      <- y2_base + rnorm(100, 0, 1)

# Non-monotonic (sine wave)
x3      <- seq(1, 100, by = 1)
y3_base <- sin(x3 / 20) * 10 + x3 * 0.2
y3      <- y3_base + rnorm(100, 0, 2)

p_inc <- ggplot(data.frame(x = x1, y = y1, base = y1_base), aes(x, y)) +
  geom_point(color = "gray60", alpha = 0.5, size = 1.5) +
  geom_line(aes(y = base), color = "steelblue", linewidth = 1.2) +
  labs(title = "Monotonically increasing", x = "X", y = "Y") +
  theme_classic()

p_dec <- ggplot(data.frame(x = x2, y = y2, base = y2_base), aes(x, y)) +
  geom_point(color = "gray60", alpha = 0.5, size = 1.5) +
  geom_line(aes(y = base), color = "steelblue", linewidth = 1.2) +
  labs(title = "Monotonically decreasing", x = "X", y = "Y") +
  theme_classic()

p_non <- ggplot(data.frame(x = x3, y = y3, base = y3_base), aes(x, y)) +
  geom_point(color = "gray60", alpha = 0.5, size = 1.5) +
  geom_line(aes(y = base), color = "#c0622a", linewidth = 1.2) +
  labs(title = "Not monotonic", x = "X", y = "Y") +
  theme_classic()

library(patchwork)
p_inc | p_dec | p_non

Examples of monotonically increasing (left), monotonically decreasing (centre), and non-monotonic (right) relationships.

The key idea: ranks, not raw values

Spearman doesn’t use the actual values of your data — it converts them to ranks first. The smallest value gets rank 1, the next gets rank 2, and so on. This makes the method resistant to outliers and valid for any monotonic shape.

x_example <- c(3, 8, 5, 9, 7)
rank(x_example)
[1] 1 4 2 5 3

The values 3, 8, 5, 9, 7 map to ranks 1, 4, 2, 5, 3.

Step-by-step calculation

\[ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \]

where \(d_i\) is the difference between the ranks of each pair of observations.

df1 <- data.frame(
  x = seq(1, 100, by = 1),
  y = exp(seq(1, 100, by = 1) / 20) + rnorm(100, 0, 10)
)

# Rank each variable
df1$rank_x <- rank(df1$x)
df1$rank_y <- rank(df1$y)

# Difference in ranks, then square
df1$d  <- df1$rank_x - df1$rank_y
df1$d2 <- df1$d^2

# Spearman's rho
n   <- nrow(df1)
rho <- 1 - (6 * sum(df1$d2)) / (n * (n^2 - 1))
cat("Spearman rho (manual):", round(rho, 7), "\n")
Spearman rho (manual): 0.8692829 
cor(df1$x, df1$y, method = "spearman")
[1] 0.8692829

Both approaches return 0.87 — a strong positive monotonic relationship, even though the underlying trend is exponential rather than linear.

Visualising the ranks

df1$highlight <- ifelse(row.names(df1) %in% c("10", "51"), "highlight", "normal")

ggplot(df1, aes(x = rank_x, y = rank_y)) +
  geom_point(aes(color = highlight), size = 2, alpha = 0.8) +
  geom_abline(slope = 1, intercept = 0,
              linetype = "dashed", color = "gray50") +
  scale_color_manual(values = c("highlight" = "#c0622a", "normal" = "steelblue")) +
  labs(x = "Rank of x", y = "Rank of y",
       title = "Rank agreement between x and y") +
  theme_classic() +
  theme(legend.position = "none")

Ranks of x vs ranks of y. Points near the 1:1 dashed line have similar ranks in both variables. Points far from it show greater rank disagreement.
Watch out: non-monotonic data

For a U-shaped (quadratic) relationship, Spearman returns something close to zero — not because there’s no pattern, but because the direction of the pattern reverses. Spearman cannot detect that(Spearman returns a value near zero because the relationship changes direction, canceling out the rank association).

df_quad <- data.frame(x = seq(-3, 3, length.out = 100))
df_quad$y <- df_quad$x^2 + rnorm(100, 0, 1)

ggplot(df_quad, aes(x, y)) +
  geom_point(color = "gray60", alpha = 0.6) +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2),
              se = FALSE, color = "#c0622a", linewidth = 1.2) +
  labs(title = "Quadratic relationship — Spearman near zero",
       subtitle = paste("Spearman rho =",
                        round(cor(df_quad$x, df_quad$y, method = "spearman"), 3)),
       x = "x", y = "y") +
  theme_classic()

A clear quadratic relationship yields a Spearman rho near zero — the method misses the pattern entirely.

Kendall’s Tau

Kendall’s tau (τ) takes a different approach from Pearson and Spearman. Instead of using raw values or rank differences, it compares pairs of observations and evaluates whether their relative ordering is consistent across both variables.

In simple terms, it asks: for any two observations, do both variables move in the same direction?

It is especially useful when:

  • Working with small datasets
  • Variables are ordinal (e.g., Likert scale ratings, satisfaction scores)
  • The data contains many tied ranks

Concordant vs discordant pairs

Take any two observations. Look at their ranks on both variables. If both ranks go in the same direction (observation A ranks higher on both variables), the pair is concordant. If the ranks go in opposite directions, it’s discordant.

Example — using ranked pairs (2,7), (3,5), (10,8):

Pair rank_x direction rank_y direction Result
(2,7) vs (3,5) 2 → 3 ↑ 7 → 5 ↓ ✗ Discordant
(2,7) vs (10,8) 2 → 10 ↑ 7 → 8 ↑ ✓ Concordant
(3,5) vs (10,8) 3 → 10 ↑ 5 → 8 ↑ ✓ Concordant

The formula

\[ \tau = \frac{C - D}{C + D} \]

where \(C\) = number of concordant pairs and \(D\) = number of discordant pairs.

The total number of unique pairs from \(n\) observations is:

\[ \binom{n}{2} = \frac{n(n-1)}{2} \]

Step-by-step calculation

x <- sample(1:100, 10)
y <- sample(1:100, 10)

df <- data.frame(x, y)
df$rank_x <- rank(df$x)
df$rank_y <- rank(df$y)

head(df, n = 5)
x y rank_x rank_y
98 29 10 5
92 11 9 1
24 16 5 2
4 59 2 6
2 64 1 7
# Count concordant (C) and discordant (D) pairs
C <- 0
D <- 0
n <- nrow(df)

for (i in 1:(n - 1)) {
  for (j in (i + 1):n) {
    dx <- df$rank_x[j] - df$rank_x[i]
    dy <- df$rank_y[j] - df$rank_y[i]

    if (dx * dy > 0) {
      C <- C + 1
    } else if (dx * dy < 0) {
      D <- D + 1
    }
  }
}

cat("Concordant pairs:", C, "\n")
Concordant pairs: 20 
cat("Discordant pairs:", D, "\n")
Discordant pairs: 25 
cat("Total pairs:",      C + D, "\n")
Total pairs: 45 
# Kendall's tau-a
tau <- (C - D) / (C + D)
cat("Kendall's tau (manual):", round(tau, 7), "\n")
Kendall's tau (manual): -0.1111111 
cor(x, y, method = "kendall")
[1] -0.1111111
Tau variants: τ-a, τ-b, τ-c

The formula above gives tau-a, which works cleanly when there are no ties. In practice, cor(..., method = "kendall") in R returns tau-b by default, which adjusts the denominator to handle tied ranks. For most use cases, tau-b is the right choice.


Correlation Is Not Slope

This is a common misconception worth clearing up. Correlation and slope both describe relationships between variables — but they measure fundamentally different things.

Concept Symbol What it measures
Correlation r, ρ, τ Strength of association (scale-free)
Slope m Rate of change (depends on scale and units)

Two datasets can have the same correlation but very different slopes:

x <- 1:100

# y1: gentle slope, moderate correlation
y1 <- x + rnorm(100, mean = 0, sd = 25)

# y2: steep slope, same correlation
y2 <- 5 * x + rnorm(100, mean = 0, sd = 130)

cor1    <- cor(x, y1)
cor2    <- cor(x, y2)
slope1  <- coef(lm(y1 ~ x))[[2]]
slope2  <- coef(lm(y2 ~ x))[[2]]

label_text <- paste0(
  "y1: r = ", round(cor1, 2), ", slope = ", round(slope1, 2), "\n",
  "y2: r = ", round(cor2, 2), ", slope = ", round(slope2, 2)
)

df4 <- data.frame(x = x, y1 = y1, y2 = y2) |>
  pivot_longer(cols = c(y1, y2), names_to = "Response", values_to = "y")

ggplot(df4, aes(x = x, y = y, color = Response)) +
  geom_point(alpha = 0.5, size = 1.5) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
  annotate("text", x = 15, y = max(df4$y) * 0.92,
           label = label_text, hjust = 0, vjust = 1,
           size = 3.5, color = "gray20", fontface = "italic") +
  scale_color_manual(values = c("y1" = "steelblue", "y2" = "#c0622a")) +
  labs(title = "Same correlation, different slopes",
       x = "x", y = "Response variable") +
  theme_classic()

Both lines have a Pearson r ≈ 0.70, but y2 rises five times faster than y1. Correlation measures how consistently the points follow a linear trend; slope captures the rate of change.

Both correlations are ~0.70, but one line rises five times faster than the other. Correlation captures consistency of the relationship; slope captures magnitude.


Summary & Practical Guide

Which method should you use?

Method Data type Relationship Outlier sensitivity
Pearson Continuous Linear Sensitive
Spearman Continuous or ordinal Monotonic Robust
Kendall Ordinal or ranked Pairwise agreement Robust

Decision guide

  • Use Pearson when variables are continuous, the relationship looks linear, and there are no extreme outliers.
  • Use Spearman when the relationship is non-linear but consistently increasing or decreasing, or when outliers are present.
  • Use Kendall when working with ordinal data, small samples, or when ties are common.
  • For two categorical variables, consider Cramer’s V. For ordered categorical variables, look into polychoric correlation.
Final reminder

A strong correlation is a starting point, not a conclusion. Always visualize the relationship first. Check assumptions. And remember — correlation never, on its own, tells you why two variables move together.


Session Info

Show session info
sessionInfo()
R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
[3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
[5] LC_TIME=English_India.utf8    

time zone: Asia/Calcutta
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] patchwork_1.3.2 tidyr_1.3.1     dplyr_1.1.4     ggplot2_4.0.1  

loaded via a namespace (and not attached):
 [1] Matrix_1.7-4       gtable_0.3.6       jsonlite_2.0.0     compiler_4.5.2    
 [5] tidyselect_1.2.1   splines_4.5.2      scales_1.4.0       yaml_2.3.10       
 [9] fastmap_1.2.0      lattice_0.22-7     R6_2.6.1           labeling_0.4.3    
[13] generics_0.1.4     knitr_1.50         htmlwidgets_1.6.4  tibble_3.3.0      
[17] pillar_1.11.1      RColorBrewer_1.1-3 rlang_1.1.6        xfun_0.54         
[21] S7_0.2.1           cli_3.6.5          withr_3.0.2        magrittr_2.0.4    
[25] mgcv_1.9-3         digest_0.6.38      grid_4.5.2         rstudioapi_0.17.1 
[29] lifecycle_1.0.4    nlme_3.1-168       vctrs_0.6.5        evaluate_1.0.5    
[33] glue_1.8.0         farver_2.1.2       rmarkdown_2.30     purrr_1.2.0       
[37] tools_4.5.2        pkgconfig_2.0.3    htmltools_0.5.8.1 

References

Akoglu, H. (2018). User’s guide to correlation coefficients. Turkish Journal of Emergency Medicine. https://doi.org/10.1016/j.tjem.2018.08.001

Bishara, A. J., & Hittner, J. B. (2012). Testing the significance of a correlation with nonnormal data: Comparison of Pearson, Spearman, transformation, and resampling approaches. Psychological Methods, 17(3), 399–417. https://doi.org/10.1037/a0028087

Havlicek, L., & Peterson, N. (1976). Robustness of the Pearson correlation against violations of assumptions. Psychological Reports, 43(3). https://doi.org/10.2466/pms.1976.43.3f.1319

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag.

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.3.