Spearman’s Rank Correlation Test

In this demonstration, we will explore Spearman’s Rank Correlation, a powerful statistical tool used to assess the monotonic relationship between two variables. Unlike Pearson correlation, which measures linear relationships and assumes normally distributed data, Spearman’s correlation is nonparametric. This means it makes no assumptions about the underlying distribution of your data. This makes it incredibly versatile, especially when dealing with data that doesn’t fit the assumptions of parametric tests.

Theory

The Spearman Rank Correlation Test is a non-parametric measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function.

Hypothesis

Null Hypothesis (\(H_0\)): There is no monotonic relationship between the two variables.
Alternative Hypothesis (\(H_A\)): There is a significant monotonic relationship between the two variables.

Test Statistic

The Spearman correlation coefficient (\(\rho\)) is calculated as:

\[ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \]

where:

\(d_i\) = difference between the ranks of each observation,
\(n\) = number of observations.

Assumptions of the Test

The variables are at least ordinal.
The relationship is monotonic (but not necessarily linear).

Distribution Under \(H_0\)

Under the null hypothesis (\(H_0\)), Spearman’s correlation follows an approximate t-distribution for significance testing.

Why is this important?

In real-world scenarios, data is often messy! It might not be normally distributed, or the relationship between variables might not be perfectly linear. Spearman’s correlation steps in to help us understand associations even in these situations.

What does “monotonic relationship” mean?

A monotonic relationship means that as one variable increases, the other variable tends to either increase or decrease, but not necessarily at a constant rate. It’s about the direction of the relationship, not the linearity.

Monotonic Increasing: As X increases, Y also tends to increase (e.g., study hours and exam scores - generally, more study hours tend to lead to higher scores).
Monotonic Decreasing: As X increases, Y tends to decrease (e.g., temperature and ice cream sales - generally, higher temperatures tend to lead to higher ice cream sales).

Spearman’s correlation quantifies the strength and direction of this monotonic relationship.

How does it work? (Simplified Explanation)

Spearman’s correlation works by:

Ranking the data: For each variable, we rank the values from lowest to highest. If there are ties, we assign average ranks.
Calculating correlation on ranks: Instead of using the raw data, we calculate the Pearson correlation coefficient using these ranks. This Pearson correlation on the ranks is Spearman’s Rank Correlation (often denoted as ρ or r_s). We can also denote it as \(r_s\).

When is Spearman’s Correlation ideal?

Non-normal data: When your data deviates significantly from a normal distribution.
Ordinal data: When you have ranked data or data that represents categories with a natural order (e.g., Likert scales, rankings in a competition).
Non-linear relationships: When you suspect a monotonic relationship but not necessarily a linear one.
Outliers: Spearman’s is less sensitive to outliers than Pearson correlation because ranks are less affected by extreme values. ### Example: Simulated Data

# Simulate some data (non-normally distributed for demonstration)
set.seed(123)  # For reproducibility
x <- runif(10, 0, 10) ^ 2 # Squared to introduce non-linearity
y <- x + rnorm(10, 0, 5) # Add noise

# Create a data frame
df <- data.frame(x = x, y = y)

# Step 1: Rank the data
df$rank_x <- rank(df$x)
df$rank_y <- rank(df$y)

print("Step 1: Data with Ranks")

## [1] "Step 1: Data with Ranks"

print(df)

##             x         y rank_x rank_y
## 1   8.2700830 16.845408      2      3
## 2  62.1424987 64.447080      7      7
## 3  16.7262123 10.400906      3      2
## 4  77.9719736 74.537709      8      8
## 5  88.4478713 86.219561     10     10
## 6   0.2075395  6.327948      1      1
## 7  27.8895407 29.688610      5      5
## 8  79.6411751 81.645032      9      9
## 9  30.4080575 30.961471      6      6
## 10 20.8497016 18.070496      4      4

# Step 2: Calculate the difference in ranks (di)
df$diff_rank <- df$rank_x - df$rank_y
df$diff_rank_sq <- df$diff_rank^2

print("Step 2: Differences and Squared Differences in Ranks")

## [1] "Step 2: Differences and Squared Differences in Ranks"

print(df)

##             x         y rank_x rank_y diff_rank diff_rank_sq
## 1   8.2700830 16.845408      2      3        -1            1
## 2  62.1424987 64.447080      7      7         0            0
## 3  16.7262123 10.400906      3      2         1            1
## 4  77.9719736 74.537709      8      8         0            0
## 5  88.4478713 86.219561     10     10         0            0
## 6   0.2075395  6.327948      1      1         0            0
## 7  27.8895407 29.688610      5      5         0            0
## 8  79.6411751 81.645032      9      9         0            0
## 9  30.4080575 30.961471      6      6         0            0
## 10 20.8497016 18.070496      4      4         0            0

# Step 3: Calculate the sum of squared differences
sum_diff_sq <- sum(df$diff_rank_sq)

print("Step 3: Sum of Squared Differences")

## [1] "Step 3: Sum of Squared Differences"

print(sum_diff_sq)

## [1] 2

# Step 4: Calculate Spearman's correlation (rho)
n <- nrow(df)
rho <- 1 - (6 * sum_diff_sq) / (n * (n^2 - 1))

print("Step 4: Spearman's Correlation (rho)")

## [1] "Step 4: Spearman's Correlation (rho)"

print(rho)

## [1] 0.9878788

# Step 5: Verify with cor.test (for comparison)
cor_test_result <- cor.test(df$x, df$y, method = "spearman")

print("Step 5: cor.test Result")

## [1] "Step 5: cor.test Result"

print(cor_test_result)

## 
##  Spearman's rank correlation rho
## 
## data:  df$x and df$y
## S = 2, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9878788

# Displaying the steps in a more table-like format for the final output
final_table <- data.frame(
  x = df$x,
  y = df$y,
  Rank_x = df$rank_x,
  Rank_y = df$rank_y,
  d_i = df$diff_rank,
  d_i_sq = df$diff_rank_sq
)

print("Final Table with all Calculations")

## [1] "Final Table with all Calculations"

print(final_table)

##             x         y Rank_x Rank_y d_i d_i_sq
## 1   8.2700830 16.845408      2      3  -1      1
## 2  62.1424987 64.447080      7      7   0      0
## 3  16.7262123 10.400906      3      2   1      1
## 4  77.9719736 74.537709      8      8   0      0
## 5  88.4478713 86.219561     10     10   0      0
## 6   0.2075395  6.327948      1      1   0      0
## 7  27.8895407 29.688610      5      5   0      0
## 8  79.6411751 81.645032      9      9   0      0
## 9  30.4080575 30.961471      6      6   0      0
## 10 20.8497016 18.070496      4      4   0      0

cat("\nSpearman's correlation (calculated manually):", rho, "\n")

## 
## Spearman's correlation (calculated manually): 0.9878788

cat("Spearman's correlation (from cor.test):", cor_test_result$estimate, "\n")

## Spearman's correlation (from cor.test): 0.9878788

Let’s analyze the correlation between Exam Scores and Hours Studied for 20 students.

# Generate Simulated Data
set.seed(123)
hours_studied <- runif(20, min = 1, max = 10)  # Random study hours between 1 and 10
exam_scores <- hours_studied * 8 + rnorm(20, mean = 0, sd = 5)  # Approx linear relationship + noise

# Create dataframe
df_spearman <- data.frame(Hours_Studied = hours_studied, Exam_Scores = exam_scores)

# Display table
kable(head(df_spearman), caption = "First Few Rows of Simulated Data")

First Few Rows of Simulated Data
Hours_Studied	Exam_Scores
3.588198	34.82599
8.094746	66.55704
4.680792	39.45020
8.947157	72.13067
9.464206	72.93444
1.410008	20.21463

# Perform Spearman Rank Correlation Test
spearman_test <- cor.test(df_spearman$Hours_Studied, df_spearman$Exam_Scores, method = "spearman")

# Print test result
spearman_test

## 
##  Spearman's rank correlation rho
## 
## data:  df_spearman$Hours_Studied and df_spearman$Exam_Scores
## S = 52, p-value = 6.473e-06
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9609023

Interpretation

The interpretation of the Spearman Rank Correlation Test depends on the p-value:

If p-value < 0.05, we reject the null hypothesis (\(H_0\)), meaning there exists a significant monotonic relationship between the two variables.
If p-value > 0.05, we fail to reject the null hypothesis (\(H_0\)), meaning there is no significant monotonic relationship between the two variables.

Thus, a small p-value indicates strong evidence against \(H_0\), suggesting a meaningful monotonic association between the variables.

ggplot(df_spearman, aes(x = Hours_Studied, y = Exam_Scores)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = "loess", se = FALSE, color = "red") +
  labs(title = "Scatter Plot: Hours Studied vs Exam Scores",
       x = "Hours Studied",
       y = "Exam Score") +
  theme_minimal()

📊 Dataset

We will be using the mtcars dataset, which is a standard dataset available in R. We’ll examine the relationship between gear (number of forward gears) and cyl (number of cylinders). Both of these variables can be treated as ordinal.

📝 Task for Students

Prepare the data. Since gear and cyl are currently numeric, we’ll create a contingency table.

# Load dataset
data(mtcars)

# Convert relevant variables to factors
mtcars$gear <- as.factor(mtcars$gear)
mtcars$cyl <- as.factor(mtcars$cyl)

# Perform Spearman Rank Correlation Test
spearman_real_test <- cor.test(as.numeric(mtcars$gear), as.numeric(mtcars$cyl), method = "spearman")

# Display table of gear and cylinder count
kable(table(mtcars$gear, mtcars$cyl), caption = "Contingency Table: Gear vs Cylinders")

Contingency Table: Gear vs Cylinders
	4	6	8
3	1	2	12
4	8	4	0
5	2	1	2

# Print test results
spearman_real_test

## 
##  Spearman's rank correlation rho
## 
## data:  as.numeric(mtcars$gear) and as.numeric(mtcars$cyl)
## S = 8534.9, p-value = 0.0007678
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5643105

If \(p\)-value \(< 0.05\), reject \(H_0\): Gear and cylinder are correlated.

If \(p\)-value \(> 0.05\), fail to reject \(H_0\): Gear and cylinder are independent.

#corr_matrix <- cor(mtcars[, c("mpg", "hp", "wt", "gear", "cyl")], method = "spearman")

#corrplot(corr_matrix, method = "circle", type = "lower",
         #tl.col = "black", title = "Spearman Correlation Matrix (mtcars)", mar = c(0,0,2,0))

# Convert any factor columns to numeric
mtcars_numeric <- mtcars  # Create a copy of mtcars

# Convert necessary columns to numeric if they are factors
mtcars_numeric$gear <- as.numeric(mtcars_numeric$gear)
mtcars_numeric$cyl <- as.numeric(mtcars_numeric$cyl)

# Compute the Spearman correlation matrix
corr_matrix <- cor(mtcars_numeric[, c("mpg", "hp", "wt", "gear", "cyl")], method = "spearman")

# Load necessary library
library(corrplot)

# Plot the Spearman correlation matrix
corrplot(corr_matrix, method = "circle", type = "lower",
         tl.col = "black", title = "Spearman Correlation Matrix (mtcars)", mar = c(0,0,2,0))

Hands-On Exercise: Employee Performance

Dataset

The following dataset contains employee years of experience and their corresponding performance scores.

\[ \begin{array}{|c|c|c|} \hline \textbf{Employee ID} & \textbf{Years of Experience} & \textbf{Performance Score} \\ \hline 1 & 2 & 70 \\ 2 & 5 & 80 \\ 3 & 3 & 75 \\ 4 & 7 & 90 \\ 5 & 4 & 78 \\ \vdots & \vdots & \vdots \\ \hline \end{array} \]

Task for Students

Follow these steps to analyze the dataset:

Load the dataset into R using data.frame() or by importing a CSV file.
Perform a Spearman Rank Correlation Test in R to assess whether a monotonic relationship exists between years of experience and performance score.
Interpret the results based on the p-value and correlation coefficient.

Sample R Code

# Example dataset: Employee Performance vs Experience
experience <- c(2, 5, 3, 7, 4, 6, 8, 10, 1, 9)
performance <- c(70, 80, 75, 90, 78, 85, 88, 95, 65, 92)

# Convert to dataframe
df_employee <- data.frame(Experience = experience, Performance = performance)

# Perform Spearman Rank Correlation Test
spearman_emp_test <- cor.test(df_employee$Experience, df_employee$Performance, method = "spearman")

# Display data
kable(df_employee, caption = "Employee Performance Data")

Employee Performance Data
Experience	Performance
2	70
5	80
3	75
7	90
4	78
6	85
8	88
10	95
1	65
9	92

# Print test results
spearman_emp_test

## 
##  Spearman's rank correlation rho
## 
## data:  df_employee$Experience and df_employee$Performance
## S = 2, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9878788

ggplot(df_employee, aes(x = Experience, y = Performance)) +
  geom_point(color = "blue", size = 3) +
  geom_smooth(method = "loess", se = FALSE, color = "red") +
  labs(title = "Scatter Plot: Experience vs Performance",
       x = "Years of Experience",
       y = "Performance Score") +
  theme_minimal()

Non-Parametric Hypothesis Test: Spearman’s Rank Correlation

Dr. Debashis Chatterjee

2025-02-19