連續資料相關性分析 (The Correlation of Continuous Data)

Pearson’s Correlation

Pearson’s Correlation 是對連續並符合常態假設(或符合中央極限定理)的資料進行相關分析的方法。若對於一群的資料，其定義為

\[ \rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X\sigma_Y} = \frac{E[(X - \mu_X)(Y - \mu_Y)]}{\sigma_X\sigma_Y} \]

其中 \(cov\) 為共變異數 (covariance)， \(\sigma_X\) 為 \(x\) 的標準差，\(\mu_X\) 為 x 的平均值, \(E\) 為期望值。

若對於單一樣本 (sample)，其定義為

\[ r = \frac{\Sigma_{i=1}^{n}(X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\Sigma_{i=1}^{n}(X_i-\bar{X})^2}\sqrt{\Sigma_{i=1}^{n}(Y_i-\bar{Y})^2}} \] 其上定義亦可表示為

\[ r = \frac{1}{n-1}\Sigma_{i=1}^{n}(\frac{X_i-\bar{X}}{S_X})(\frac{Y_i-\bar{Y}}{S_Y})\\ \bar{X} = \frac{1}{n}\Sigma_{i=1}^{n}X_i \\ \ S_X = \sqrt{\frac{1}{n-1}\Sigma_{i=1}^{n}(X_i-\bar{X})^{2}} \]

\(\bar{X}\) 為 x 的算術平均，\(S_X\) 為 x 的標準差。

Spearman’s Rank Correlation

Spearman’s Rank Correlation 是對連續但不符合常態假設(或不符合中央極限定理)的資料進行相關分析的方法。Spearman’s Rank Correlation 被定義為等級變量間的 Pearson’s Correlation，Spearman 會先將資料透過排序 (Rank) ，並將排序後的等級進行 Pearson’s Correlation。

舉例而言，會先將資料進行排序處理，

變量 \(x_i\)	降序位置	等級 \(rg_{x_i}\)
0.8	5	5
1.0	4	\(\frac{3+4}{2} = 3.5\)
1.0	3	\(\frac{3+4}{2} = 3.5\)
2.5	2	2
3.0	1	1

故 Spearman’s Rank Correlation 如下定義

\[ r_s = \rho_{rg_x, rg_y} = \frac{cov(rg_x, rg_y)}{\rho_{rg_x}\rho_{rg_y}} \]

其中 \(\rho\) 即是 Pearson’s Correlation 定義，但應用於排序後的值(rank variables)。\(cov(rg_x, rg_y)\) 為排序值的共變異係數。\(\rho_{rg_x}\) 與 \(\rho_{rg_y}\) 為排序值的標準差。

而在實際應用中，若所有 \(n\) 個排序皆為不同的整數，更可直接透過推倒公式來計算相關性。

\[ r_s = 1 - \frac{3\sum d_i^2}{n(n^2-1)} \]

\(d_i = rg(x_i) - rg(y_i)\) 為兩個變數之排序值的差。\(n\) 表示有幾個變數。

相關性的值介於 -1 至 1 之間，其中 1 為完全正相關，-1 為完全負相關，0 為無相關。

R 實作

相關性係數

於 R 的基礎套件 stats 中已有內建函式 cor 可以計算相關性。其中可透過 method 來轉換 Pearson 或 Spearman 的相關性計算。

# the prototype of correlation
cor(x, y = NULL, use = "everything",
    method = c("pearson", "kendall", "spearman"))

set.seed(123)

# 準備資料
data1 <- c(
  c(1:100) + rnorm(100, mean = 10, sd = 5) - rnorm(100, mean = 5, sd = 5),
  c(140:150) + rnorm(10, mean = 5, sd = 5)
)

## Warning in c(140:150) + rnorm(10, mean = 5, sd = 5): 較長的物件長度並非較短
## 物件長度的倍數

data2 <- c(
  c(1:100) + rnorm(100, mean = 20, sd = 10) - rnorm(100, mean = 10, sd = 10),
  c(80:90) + rnorm(10, mean = 5, sd = 5)
)

## Warning in c(80:90) + rnorm(10, mean = 5, sd = 5): 較長的物件長度並非較短物
## 件長度的倍數

# 繪出資料分布
plot(x=data1, y=data2, col="orange", pch=19)

# Pearson's Correlation
pearsonRes <- cor(data1, data2, use = "everything", method = "pearson")
pearsonRes

## [1] 0.7842095

# Spearman's Rank Correlation
spearmanRes <- cor(data1, data2, use = "everything", method = "spearman")
spearmanRes

## [1] 0.8814584

由上可以看出 Pearson’s 與 Spearman’s 相關性計算結果有明顯差異。

相關性檢驗

於 R 的基礎套件 stats 中已有內建函式 cor.test 可以對相關性進行檢定。其中可透過 method 來轉換 Pearson 或 Spearman 的相關性計算。

# the prototype of cor.test
# alternative: 檢定方式為雙尾，單尾(less, greater)等
cor.test(x, y,
         alternative = c("two.sided", "less", "greater"),
         method = c("pearson", "kendall", "spearman"),
         exact = NULL, conf.level = 0.95, continuity = FALSE, ...)

Pearson’s Correlation

# 相關性檢定
pearsonTest <- cor.test(data1, data2, alternative = "two.sided", method = "pearson")

# 透過屬性 estimate 來取得相關性
pearsonTest$estimate

##       cor 
## 0.7842095

# 透過屬性 p.value 來取得檢驗結果
pearsonTest$p.value

## [1] 2.47712e-24

Spearman’s Rank Correlation

# 相關性檢定
spearmanTest <- cor.test(data1, data2, alternative = "two.sided", method = "spearman")

# 透過屬性 estimate 來取得相關性
spearmanTest$estimate

##       rho 
## 0.8814584

# 透過屬性 p.value 來取得檢驗結果
spearmanTest$p.value

## [1] 0