In this demonstration, we will explore Spearman’s Rank Correlation, a powerful statistical tool used to assess the monotonic relationship between two variables. Unlike Pearson correlation, which measures linear relationships and assumes normally distributed data, Spearman’s correlation is nonparametric. This means it makes no assumptions about the underlying distribution of your data. This makes it incredibly versatile, especially when dealing with data that doesn’t fit the assumptions of parametric tests.
The Spearman Rank Correlation Test is a non-parametric measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function.
The Spearman correlation coefficient (\(\rho\)) is calculated as:
\[ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} \]
where:
Under the null hypothesis (\(H_0\)), Spearman’s correlation follows an approximate t-distribution for significance testing.
In real-world scenarios, data is often messy! It might not be normally distributed, or the relationship between variables might not be perfectly linear. Spearman’s correlation steps in to help us understand associations even in these situations.
A monotonic relationship means that as one variable increases, the other variable tends to either increase or decrease, but not necessarily at a constant rate. It’s about the direction of the relationship, not the linearity.
Spearman’s correlation quantifies the strength and direction of this monotonic relationship.
Spearman’s correlation works by:
# Simulate some data (non-normally distributed for demonstration)
set.seed(123) # For reproducibility
x <- runif(10, 0, 10) ^ 2 # Squared to introduce non-linearity
y <- x + rnorm(10, 0, 5) # Add noise
# Create a data frame
df <- data.frame(x = x, y = y)
# Step 1: Rank the data
df$rank_x <- rank(df$x)
df$rank_y <- rank(df$y)
print("Step 1: Data with Ranks")
## [1] "Step 1: Data with Ranks"
print(df)
## x y rank_x rank_y
## 1 8.2700830 16.845408 2 3
## 2 62.1424987 64.447080 7 7
## 3 16.7262123 10.400906 3 2
## 4 77.9719736 74.537709 8 8
## 5 88.4478713 86.219561 10 10
## 6 0.2075395 6.327948 1 1
## 7 27.8895407 29.688610 5 5
## 8 79.6411751 81.645032 9 9
## 9 30.4080575 30.961471 6 6
## 10 20.8497016 18.070496 4 4
# Step 2: Calculate the difference in ranks (di)
df$diff_rank <- df$rank_x - df$rank_y
df$diff_rank_sq <- df$diff_rank^2
print("Step 2: Differences and Squared Differences in Ranks")
## [1] "Step 2: Differences and Squared Differences in Ranks"
print(df)
## x y rank_x rank_y diff_rank diff_rank_sq
## 1 8.2700830 16.845408 2 3 -1 1
## 2 62.1424987 64.447080 7 7 0 0
## 3 16.7262123 10.400906 3 2 1 1
## 4 77.9719736 74.537709 8 8 0 0
## 5 88.4478713 86.219561 10 10 0 0
## 6 0.2075395 6.327948 1 1 0 0
## 7 27.8895407 29.688610 5 5 0 0
## 8 79.6411751 81.645032 9 9 0 0
## 9 30.4080575 30.961471 6 6 0 0
## 10 20.8497016 18.070496 4 4 0 0
# Step 3: Calculate the sum of squared differences
sum_diff_sq <- sum(df$diff_rank_sq)
print("Step 3: Sum of Squared Differences")
## [1] "Step 3: Sum of Squared Differences"
print(sum_diff_sq)
## [1] 2
# Step 4: Calculate Spearman's correlation (rho)
n <- nrow(df)
rho <- 1 - (6 * sum_diff_sq) / (n * (n^2 - 1))
print("Step 4: Spearman's Correlation (rho)")
## [1] "Step 4: Spearman's Correlation (rho)"
print(rho)
## [1] 0.9878788
# Step 5: Verify with cor.test (for comparison)
cor_test_result <- cor.test(df$x, df$y, method = "spearman")
print("Step 5: cor.test Result")
## [1] "Step 5: cor.test Result"
print(cor_test_result)
##
## Spearman's rank correlation rho
##
## data: df$x and df$y
## S = 2, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9878788
# Displaying the steps in a more table-like format for the final output
final_table <- data.frame(
x = df$x,
y = df$y,
Rank_x = df$rank_x,
Rank_y = df$rank_y,
d_i = df$diff_rank,
d_i_sq = df$diff_rank_sq
)
print("Final Table with all Calculations")
## [1] "Final Table with all Calculations"
print(final_table)
## x y Rank_x Rank_y d_i d_i_sq
## 1 8.2700830 16.845408 2 3 -1 1
## 2 62.1424987 64.447080 7 7 0 0
## 3 16.7262123 10.400906 3 2 1 1
## 4 77.9719736 74.537709 8 8 0 0
## 5 88.4478713 86.219561 10 10 0 0
## 6 0.2075395 6.327948 1 1 0 0
## 7 27.8895407 29.688610 5 5 0 0
## 8 79.6411751 81.645032 9 9 0 0
## 9 30.4080575 30.961471 6 6 0 0
## 10 20.8497016 18.070496 4 4 0 0
cat("\nSpearman's correlation (calculated manually):", rho, "\n")
##
## Spearman's correlation (calculated manually): 0.9878788
cat("Spearman's correlation (from cor.test):", cor_test_result$estimate, "\n")
## Spearman's correlation (from cor.test): 0.9878788
Let’s analyze the correlation between Exam Scores and Hours Studied for 20 students.
# Generate Simulated Data
set.seed(123)
hours_studied <- runif(20, min = 1, max = 10) # Random study hours between 1 and 10
exam_scores <- hours_studied * 8 + rnorm(20, mean = 0, sd = 5) # Approx linear relationship + noise
# Create dataframe
df_spearman <- data.frame(Hours_Studied = hours_studied, Exam_Scores = exam_scores)
# Display table
kable(head(df_spearman), caption = "First Few Rows of Simulated Data")
Hours_Studied | Exam_Scores |
---|---|
3.588198 | 34.82599 |
8.094746 | 66.55704 |
4.680792 | 39.45020 |
8.947157 | 72.13067 |
9.464206 | 72.93444 |
1.410008 | 20.21463 |
# Perform Spearman Rank Correlation Test
spearman_test <- cor.test(df_spearman$Hours_Studied, df_spearman$Exam_Scores, method = "spearman")
# Print test result
spearman_test
##
## Spearman's rank correlation rho
##
## data: df_spearman$Hours_Studied and df_spearman$Exam_Scores
## S = 52, p-value = 6.473e-06
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9609023
The interpretation of the Spearman Rank Correlation Test depends on the p-value:
Thus, a small p-value indicates strong evidence against \(H_0\), suggesting a meaningful monotonic association between the variables.
ggplot(df_spearman, aes(x = Hours_Studied, y = Exam_Scores)) +
geom_point(color = "blue", size = 3) +
geom_smooth(method = "loess", se = FALSE, color = "red") +
labs(title = "Scatter Plot: Hours Studied vs Exam Scores",
x = "Hours Studied",
y = "Exam Score") +
theme_minimal()
We will be using the mtcars
dataset, which is a standard
dataset available in R. We’ll examine the relationship between
gear
(number of forward gears) and cyl
(number
of cylinders). Both of these variables can be treated as ordinal.
gear
and
cyl
are currently numeric, we’ll create a contingency
table.# Load dataset
data(mtcars)
# Convert relevant variables to factors
mtcars$gear <- as.factor(mtcars$gear)
mtcars$cyl <- as.factor(mtcars$cyl)
# Perform Spearman Rank Correlation Test
spearman_real_test <- cor.test(as.numeric(mtcars$gear), as.numeric(mtcars$cyl), method = "spearman")
# Display table of gear and cylinder count
kable(table(mtcars$gear, mtcars$cyl), caption = "Contingency Table: Gear vs Cylinders")
4 | 6 | 8 | |
---|---|---|---|
3 | 1 | 2 | 12 |
4 | 8 | 4 | 0 |
5 | 2 | 1 | 2 |
# Print test results
spearman_real_test
##
## Spearman's rank correlation rho
##
## data: as.numeric(mtcars$gear) and as.numeric(mtcars$cyl)
## S = 8534.9, p-value = 0.0007678
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.5643105
If \(p\)-value \(< 0.05\), reject \(H_0\): Gear and cylinder are correlated.
If \(p\)-value \(> 0.05\), fail to reject \(H_0\): Gear and cylinder are independent.
#corr_matrix <- cor(mtcars[, c("mpg", "hp", "wt", "gear", "cyl")], method = "spearman")
#corrplot(corr_matrix, method = "circle", type = "lower",
#tl.col = "black", title = "Spearman Correlation Matrix (mtcars)", mar = c(0,0,2,0))
# Convert any factor columns to numeric
mtcars_numeric <- mtcars # Create a copy of mtcars
# Convert necessary columns to numeric if they are factors
mtcars_numeric$gear <- as.numeric(mtcars_numeric$gear)
mtcars_numeric$cyl <- as.numeric(mtcars_numeric$cyl)
# Compute the Spearman correlation matrix
corr_matrix <- cor(mtcars_numeric[, c("mpg", "hp", "wt", "gear", "cyl")], method = "spearman")
# Load necessary library
library(corrplot)
# Plot the Spearman correlation matrix
corrplot(corr_matrix, method = "circle", type = "lower",
tl.col = "black", title = "Spearman Correlation Matrix (mtcars)", mar = c(0,0,2,0))
The following dataset contains employee years of experience and their corresponding performance scores.
\[ \begin{array}{|c|c|c|} \hline \textbf{Employee ID} & \textbf{Years of Experience} & \textbf{Performance Score} \\ \hline 1 & 2 & 70 \\ 2 & 5 & 80 \\ 3 & 3 & 75 \\ 4 & 7 & 90 \\ 5 & 4 & 78 \\ \vdots & \vdots & \vdots \\ \hline \end{array} \]
Follow these steps to analyze the dataset:
data.frame()
or by importing a CSV file.# Example dataset: Employee Performance vs Experience
experience <- c(2, 5, 3, 7, 4, 6, 8, 10, 1, 9)
performance <- c(70, 80, 75, 90, 78, 85, 88, 95, 65, 92)
# Convert to dataframe
df_employee <- data.frame(Experience = experience, Performance = performance)
# Perform Spearman Rank Correlation Test
spearman_emp_test <- cor.test(df_employee$Experience, df_employee$Performance, method = "spearman")
# Display data
kable(df_employee, caption = "Employee Performance Data")
Experience | Performance |
---|---|
2 | 70 |
5 | 80 |
3 | 75 |
7 | 90 |
4 | 78 |
6 | 85 |
8 | 88 |
10 | 95 |
1 | 65 |
9 | 92 |
# Print test results
spearman_emp_test
##
## Spearman's rank correlation rho
##
## data: df_employee$Experience and df_employee$Performance
## S = 2, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9878788
ggplot(df_employee, aes(x = Experience, y = Performance)) +
geom_point(color = "blue", size = 3) +
geom_smooth(method = "loess", se = FALSE, color = "red") +
labs(title = "Scatter Plot: Experience vs Performance",
x = "Years of Experience",
y = "Performance Score") +
theme_minimal()