library(ggplot2) # Plotting
library(dplyr) # Data wrangling
library(tidyr) # Data reshapingUnderstanding Correlation: Pearson, Spearman, and Kendall’s Tau
A hands-on guide to the three most used correlation methods in R — what they measure, when to use each one, and how to calculate them from first principles.
What is Correlation?
Correlation is one of the most natural starting points when exploring data. It answers a deceptively simple question: do these two things tend to move together?
Think about studying and exam scores — if you study more, do you tend to score higher? Or consider ice cream sales and drowning incidents. Believe it or not, those two are positively correlated. But no one thinks ice cream causes drowning. Warmer weather drives both. That’s a classic spurious correlation — variables moving in sync for a hidden reason.
Correlation tells you whether and how strongly two variables are associated — not why. Always be skeptical of assuming causation from correlation alone.
The correlation coefficient captures both the direction and strength of a relationship in a single number, always ranging from −1 to +1:
| Value | Meaning |
|---|---|
| +1 | Perfect positive relationship |
| 0 | No relationship |
| −1 | Perfect negative relationship |
In this article, we walk through three methods for calculating correlation — Pearson, Spearman, and Kendall’s tau — each suited to different types of data and relationships. We’ll also build each one from first principles so the mechanics become clear.
Pearson Correlation
The Pearson correlation coefficient — denoted r — is the workhorse of correlation. It measures the strength and direction of a linear relationship between two continuous variables.
When to use it
Before calculating Pearson correlation, three assumptions should be met:
- Both variables are continuous
- The relationship is approximately linear
- There are no extreme outliers
For non-normal data, some researchers recommend transforming the variables first. Others argue Pearson is robust to mild normality violations (Havlicek & Peterson, 1976). When in doubt, visualize with a scatter plot — your eyes are often the best check.
These assumptions are ideal conditions, but Pearson is often reasonably robust to mild violations.
Step 1 — Generate and visualize the data
We generate two variables, x and y, where y is linearly related to x plus random noise. A scatter plot first confirms linearity.
set.seed(404)
x <- rnorm(100, mean = 8, sd = 2)
y <- 0.5 * x + rnorm(100, sd = 1)
plot(x, y,
main = "Scatterplot of x vs y",
xlab = "x", ylab = "y",
pch = 19, col = "steelblue")Good. The upward trend is clear — as x increases, y tends to increase too.
Step 2 — Mean-centering
The core idea behind Pearson correlation is mean-centering: for each value, subtract the variable’s mean. This shifts the data so the center is zero. What remains tells us how each observation sits above or below average.
# Step 1: Find the means
mean_x <- mean(x)
mean_y <- mean(y)
cat("Mean of x:", round(mean_x, 4), "\n")Mean of x: 8.3025
cat("Mean of y:", round(mean_y, 4), "\n")Mean of y: 4.098
# Step 2: Subtract the mean from each value
dev_x <- x - mean_x
dev_y <- y - mean_yAfter centering, positive deviations mean “above average” and negative deviations mean “below average.” Now imagine multiplying each pair of deviations together — this reveals whether both variables move in the same direction at the same time.
The quadrant intuition
When we plot the mean-centered deviations, the plane divides into four quadrants. The sign of the product tells us the direction of co-movement:
- Quadrant I (+, +): Both above average → positive product ✓
- Quadrant II (−, +): x below avg, y above → negative product
- Quadrant III (−, −): Both below average → positive product ✓
- Quadrant IV (+, −): x above avg, y below → negative product
If most points fall in Quadrants I and III, the products sum to a large positive value — indicating positive correlation.
par(mfrow = c(1, 2))
# Left: original data with mean lines
plot(x, y,
main = "Raw data with mean lines",
xlab = "x", ylab = "y",
pch = 19, col = "steelblue")
abline(h = mean_y, col = "black", lty = 3, lwd = 2)
abline(v = mean_x, col = "black", lty = 6, lwd = 2)
legend("topright",
legend = c("Mean of x", "Mean of y"),
col = "black", lty = c(6, 3), bty = "n")
# Right: centered deviations coloured by product sign
prod_dev <- dev_x * dev_y
colors <- ifelse(prod_dev > 0, "salmon", "cornflowerblue")
plot(dev_x, dev_y,
main = "Mean-centered deviations",
xlab = "Deviations x", ylab = "Deviations y",
pch = 19, col = colors)
abline(h = 0, col = "black", lty = 3, lwd = 2)
abline(v = 0, col = "black", lty = 6, lwd = 2)
# Draw vectors from the origin
for (i in seq_along(dev_x)) {
lines(c(0, dev_x[i]), c(0, dev_y[i]), col = "gray80", lty = 3)
}
# Quadrant labels
x_off <- max(abs(dev_x)) * 0.65
y_off <- max(abs(dev_y)) * 0.65
text( x_off, y_off, "Quadrant I (+,+)", cex = 0.8)
text(-x_off, y_off, "Quadrant II (-,+)", cex = 0.8)
text(-x_off, -y_off, "Quadrant III (-,-)", cex = 0.8)
text( x_off, -y_off, "Quadrant IV (+,-)", cex = 0.8)par(mfrow = c(1, 1))Step 3 — Calculate by hand
\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \cdot \sum (y_i - \bar{y})^2}} \]
# Numerator: sum of products of deviations
sum_product_dev <- sum(dev_x * dev_y)
# Denominator: square root of product of summed squared deviations
sum_sq_x <- sum(dev_x^2)
sum_sq_y <- sum(dev_y^2)
# Pearson's r
r_manual <- sum_product_dev / sqrt(sum_sq_x * sum_sq_y)
cat("Pearson r (manual):", round(r_manual, 7), "\n")Pearson r (manual): 0.6791683
We can verify this instantly with cor():
cor(x, y, method = "pearson")[1] 0.6791683
A value of 0.68 indicates a moderate positive linear relationship. Both variables tend to rise together, though there is meaningful scatter around the trend line.
Spearman Correlation
What if the relationship between your variables isn’t a straight line? Maybe it’s exponential — always increasing, but not at a constant rate. Pearson might still detect it, but Spearman is specifically designed for this: it measures monotonic relationships(Spearman measures the strength of monotonic association, not the exact functional form of the relationship).
A monotonic relationship is one where the variables consistently move in the same direction. The rate of change doesn’t have to be constant — it just can’t reverse direction.
Types of monotonic relationships
# Increasing
x1 <- seq(1, 100, by = 1)
y1_base <- exp(x1 / 20)
y1 <- y1_base + rnorm(100, 0, 10)
# Decreasing
x2 <- seq(0, 10, length.out = 100)
y2_base <- 10 * exp(-0.5 * x2)
y2 <- y2_base + rnorm(100, 0, 1)
# Non-monotonic (sine wave)
x3 <- seq(1, 100, by = 1)
y3_base <- sin(x3 / 20) * 10 + x3 * 0.2
y3 <- y3_base + rnorm(100, 0, 2)
p_inc <- ggplot(data.frame(x = x1, y = y1, base = y1_base), aes(x, y)) +
geom_point(color = "gray60", alpha = 0.5, size = 1.5) +
geom_line(aes(y = base), color = "steelblue", linewidth = 1.2) +
labs(title = "Monotonically increasing", x = "X", y = "Y") +
theme_classic()
p_dec <- ggplot(data.frame(x = x2, y = y2, base = y2_base), aes(x, y)) +
geom_point(color = "gray60", alpha = 0.5, size = 1.5) +
geom_line(aes(y = base), color = "steelblue", linewidth = 1.2) +
labs(title = "Monotonically decreasing", x = "X", y = "Y") +
theme_classic()
p_non <- ggplot(data.frame(x = x3, y = y3, base = y3_base), aes(x, y)) +
geom_point(color = "gray60", alpha = 0.5, size = 1.5) +
geom_line(aes(y = base), color = "#c0622a", linewidth = 1.2) +
labs(title = "Not monotonic", x = "X", y = "Y") +
theme_classic()
library(patchwork)
p_inc | p_dec | p_nonThe key idea: ranks, not raw values
Spearman doesn’t use the actual values of your data — it converts them to ranks first. The smallest value gets rank 1, the next gets rank 2, and so on. This makes the method resistant to outliers and valid for any monotonic shape.
x_example <- c(3, 8, 5, 9, 7)
rank(x_example)[1] 1 4 2 5 3
The values 3, 8, 5, 9, 7 map to ranks 1, 4, 2, 5, 3.
Step-by-step calculation
\[ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \]
where \(d_i\) is the difference between the ranks of each pair of observations.
df1 <- data.frame(
x = seq(1, 100, by = 1),
y = exp(seq(1, 100, by = 1) / 20) + rnorm(100, 0, 10)
)
# Rank each variable
df1$rank_x <- rank(df1$x)
df1$rank_y <- rank(df1$y)
# Difference in ranks, then square
df1$d <- df1$rank_x - df1$rank_y
df1$d2 <- df1$d^2
# Spearman's rho
n <- nrow(df1)
rho <- 1 - (6 * sum(df1$d2)) / (n * (n^2 - 1))
cat("Spearman rho (manual):", round(rho, 7), "\n")Spearman rho (manual): 0.8692829
cor(df1$x, df1$y, method = "spearman")[1] 0.8692829
Both approaches return 0.87 — a strong positive monotonic relationship, even though the underlying trend is exponential rather than linear.
Visualising the ranks
df1$highlight <- ifelse(row.names(df1) %in% c("10", "51"), "highlight", "normal")
ggplot(df1, aes(x = rank_x, y = rank_y)) +
geom_point(aes(color = highlight), size = 2, alpha = 0.8) +
geom_abline(slope = 1, intercept = 0,
linetype = "dashed", color = "gray50") +
scale_color_manual(values = c("highlight" = "#c0622a", "normal" = "steelblue")) +
labs(x = "Rank of x", y = "Rank of y",
title = "Rank agreement between x and y") +
theme_classic() +
theme(legend.position = "none")For a U-shaped (quadratic) relationship, Spearman returns something close to zero — not because there’s no pattern, but because the direction of the pattern reverses. Spearman cannot detect that(Spearman returns a value near zero because the relationship changes direction, canceling out the rank association).
df_quad <- data.frame(x = seq(-3, 3, length.out = 100))
df_quad$y <- df_quad$x^2 + rnorm(100, 0, 1)
ggplot(df_quad, aes(x, y)) +
geom_point(color = "gray60", alpha = 0.6) +
geom_smooth(method = "lm", formula = y ~ poly(x, 2),
se = FALSE, color = "#c0622a", linewidth = 1.2) +
labs(title = "Quadratic relationship — Spearman near zero",
subtitle = paste("Spearman rho =",
round(cor(df_quad$x, df_quad$y, method = "spearman"), 3)),
x = "x", y = "y") +
theme_classic()Kendall’s Tau
Kendall’s tau (τ) takes a different approach from Pearson and Spearman. Instead of using raw values or rank differences, it compares pairs of observations and evaluates whether their relative ordering is consistent across both variables.
In simple terms, it asks: for any two observations, do both variables move in the same direction?
It is especially useful when:
- Working with small datasets
- Variables are ordinal (e.g., Likert scale ratings, satisfaction scores)
- The data contains many tied ranks
Concordant vs discordant pairs
Take any two observations. Look at their ranks on both variables. If both ranks go in the same direction (observation A ranks higher on both variables), the pair is concordant. If the ranks go in opposite directions, it’s discordant.
Example — using ranked pairs (2,7), (3,5), (10,8):
| Pair | rank_x direction | rank_y direction | Result |
|---|---|---|---|
| (2,7) vs (3,5) | 2 → 3 ↑ | 7 → 5 ↓ | ✗ Discordant |
| (2,7) vs (10,8) | 2 → 10 ↑ | 7 → 8 ↑ | ✓ Concordant |
| (3,5) vs (10,8) | 3 → 10 ↑ | 5 → 8 ↑ | ✓ Concordant |
The formula
\[ \tau = \frac{C - D}{C + D} \]
where \(C\) = number of concordant pairs and \(D\) = number of discordant pairs.
The total number of unique pairs from \(n\) observations is:
\[ \binom{n}{2} = \frac{n(n-1)}{2} \]
Step-by-step calculation
x <- sample(1:100, 10)
y <- sample(1:100, 10)
df <- data.frame(x, y)
df$rank_x <- rank(df$x)
df$rank_y <- rank(df$y)
head(df, n = 5)| x | y | rank_x | rank_y |
|---|---|---|---|
| 98 | 29 | 10 | 5 |
| 92 | 11 | 9 | 1 |
| 24 | 16 | 5 | 2 |
| 4 | 59 | 2 | 6 |
| 2 | 64 | 1 | 7 |
# Count concordant (C) and discordant (D) pairs
C <- 0
D <- 0
n <- nrow(df)
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
dx <- df$rank_x[j] - df$rank_x[i]
dy <- df$rank_y[j] - df$rank_y[i]
if (dx * dy > 0) {
C <- C + 1
} else if (dx * dy < 0) {
D <- D + 1
}
}
}
cat("Concordant pairs:", C, "\n")Concordant pairs: 20
cat("Discordant pairs:", D, "\n")Discordant pairs: 25
cat("Total pairs:", C + D, "\n")Total pairs: 45
# Kendall's tau-a
tau <- (C - D) / (C + D)
cat("Kendall's tau (manual):", round(tau, 7), "\n")Kendall's tau (manual): -0.1111111
cor(x, y, method = "kendall")[1] -0.1111111
The formula above gives tau-a, which works cleanly when there are no ties. In practice, cor(..., method = "kendall") in R returns tau-b by default, which adjusts the denominator to handle tied ranks. For most use cases, tau-b is the right choice.
Correlation Is Not Slope
This is a common misconception worth clearing up. Correlation and slope both describe relationships between variables — but they measure fundamentally different things.
| Concept | Symbol | What it measures |
|---|---|---|
| Correlation | r, ρ, τ | Strength of association (scale-free) |
| Slope | m | Rate of change (depends on scale and units) |
Two datasets can have the same correlation but very different slopes:
x <- 1:100
# y1: gentle slope, moderate correlation
y1 <- x + rnorm(100, mean = 0, sd = 25)
# y2: steep slope, same correlation
y2 <- 5 * x + rnorm(100, mean = 0, sd = 130)
cor1 <- cor(x, y1)
cor2 <- cor(x, y2)
slope1 <- coef(lm(y1 ~ x))[[2]]
slope2 <- coef(lm(y2 ~ x))[[2]]
label_text <- paste0(
"y1: r = ", round(cor1, 2), ", slope = ", round(slope1, 2), "\n",
"y2: r = ", round(cor2, 2), ", slope = ", round(slope2, 2)
)
df4 <- data.frame(x = x, y1 = y1, y2 = y2) |>
pivot_longer(cols = c(y1, y2), names_to = "Response", values_to = "y")
ggplot(df4, aes(x = x, y = y, color = Response)) +
geom_point(alpha = 0.5, size = 1.5) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
annotate("text", x = 15, y = max(df4$y) * 0.92,
label = label_text, hjust = 0, vjust = 1,
size = 3.5, color = "gray20", fontface = "italic") +
scale_color_manual(values = c("y1" = "steelblue", "y2" = "#c0622a")) +
labs(title = "Same correlation, different slopes",
x = "x", y = "Response variable") +
theme_classic()Both correlations are ~0.70, but one line rises five times faster than the other. Correlation captures consistency of the relationship; slope captures magnitude.
Summary & Practical Guide
Which method should you use?
| Method | Data type | Relationship | Outlier sensitivity |
|---|---|---|---|
| Pearson | Continuous | Linear | Sensitive |
| Spearman | Continuous or ordinal | Monotonic | Robust |
| Kendall | Ordinal or ranked | Pairwise agreement | Robust |
Decision guide
- Use Pearson when variables are continuous, the relationship looks linear, and there are no extreme outliers.
- Use Spearman when the relationship is non-linear but consistently increasing or decreasing, or when outliers are present.
- Use Kendall when working with ordinal data, small samples, or when ties are common.
- For two categorical variables, consider Cramer’s V. For ordered categorical variables, look into polychoric correlation.
A strong correlation is a starting point, not a conclusion. Always visualize the relationship first. Check assumptions. And remember — correlation never, on its own, tells you why two variables move together.
Session Info
Show session info
sessionInfo()R version 4.5.2 (2025-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)
Matrix products: default
LAPACK version 3.12.1
locale:
[1] LC_COLLATE=English_India.utf8 LC_CTYPE=English_India.utf8
[3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C
[5] LC_TIME=English_India.utf8
time zone: Asia/Calcutta
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] patchwork_1.3.2 tidyr_1.3.1 dplyr_1.1.4 ggplot2_4.0.1
loaded via a namespace (and not attached):
[1] Matrix_1.7-4 gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.2
[5] tidyselect_1.2.1 splines_4.5.2 scales_1.4.0 yaml_2.3.10
[9] fastmap_1.2.0 lattice_0.22-7 R6_2.6.1 labeling_0.4.3
[13] generics_0.1.4 knitr_1.50 htmlwidgets_1.6.4 tibble_3.3.0
[17] pillar_1.11.1 RColorBrewer_1.1-3 rlang_1.1.6 xfun_0.54
[21] S7_0.2.1 cli_3.6.5 withr_3.0.2 magrittr_2.0.4
[25] mgcv_1.9-3 digest_0.6.38 grid_4.5.2 rstudioapi_0.17.1
[29] lifecycle_1.0.4 nlme_3.1-168 vctrs_0.6.5 evaluate_1.0.5
[33] glue_1.8.0 farver_2.1.2 rmarkdown_2.30 purrr_1.2.0
[37] tools_4.5.2 pkgconfig_2.0.3 htmltools_0.5.8.1
References
Akoglu, H. (2018). User’s guide to correlation coefficients. Turkish Journal of Emergency Medicine. https://doi.org/10.1016/j.tjem.2018.08.001
Bishara, A. J., & Hittner, J. B. (2012). Testing the significance of a correlation with nonnormal data: Comparison of Pearson, Spearman, transformation, and resampling approaches. Psychological Methods, 17(3), 399–417. https://doi.org/10.1037/a0028087
Havlicek, L., & Peterson, N. (1976). Robustness of the Pearson correlation against violations of assumptions. Psychological Reports, 43(3). https://doi.org/10.2466/pms.1976.43.3f.1319
Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag.
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.3.