Correlation Coefficient Part 2

Các phương pháp tính hệ số tương quan

Tương quan Pearson

Sử dụng bộ dữ liệu states về thu nhập 50 Bang của nước Mỹ (U.S. Department of Commerce, Bureau of the Census (1977) Statistical Abstract of the United States

##            Population    Income
## Population  1.0000000 0.2082276
## Income      0.2082276 1.0000000

Tương quan hạng Spearman

library(DT)
library(dplyr)
library(corrplot)
states <- states[,c(1:2)]
as.data.frame(states) -> states
states %>% datatable()

# Xây dựng theo công thức
n <- dim(states)[1]
states %>% 
  as.data.frame() %>% 
  mutate(rgX = rank(Population, ties.method= "first"),
         rgY = rank(Income, ties.method= "first"),
         d = rgX - rgY,
         d2 = d^2) %>% 
  mutate(Spearman_cor = 1 - 6 * sum(d2)/(n * (n^2 -1))) %>% 
  datatable()

# Sử dụng hàm có sẵn trong R
M <- cor(states[,c("Population", "Income")], method = "spearman")
M

##            Population    Income
## Population  1.0000000 0.1246098
## Income      0.1246098 1.0000000

Tương quan hạng Kendall

Đánh giá mức độ tương quan của 2 hạng của 2 biến (rank-ordered variables), hệ số này được sử dụng tương tự như spearman, thông thường hệ số này nhỏ hơn spearman
Hệ số kendall ít dùng hơn so với 2 hệ số tương quan trên
Công thức tính trên R: cor(df, method = “kendall”)

M <- cor(states, method = "kendall")
M

##            Population     Income
## Population 1.00000000 0.08408163
## Income     0.08408163 1.00000000

Kiểm định thống kê tính tương quan giữa 2 biến số:

Cũng như phương pháp tính, kiểm định cũng có 3 method: Pearson, Spearman, Kendall
Giả thiết kiểm định:
- \(H_{0}\): Không có tương quan (hệ số tương quan = 0)
- \(H_{a}\): Có tương quan
Ví dụ: Sử dụng hàm cor.test của gói stats Vẫn sử dụng bộ dữ liệu trên, thay bằng hai biến khác là tỷ lệ tội phạm và tỷ lệ mù chữ ở các Bang

library(dplyr)
states <- state.x77
states[,c(3,5)] %>% cor()

##            Illiteracy    Murder
## Illiteracy  1.0000000 0.7029752
## Murder      0.7029752 1.0000000

cor.test(states[,3], states[,5])

## 
##  Pearson's product-moment correlation
## 
## data:  states[, 3] and states[, 5]
## t = 6.8479, df = 48, p-value = 1.258e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5279280 0.8207295
## sample estimates:
##       cor 
## 0.7029752

Tuy nhiên, hàm này chỉ dùng được với 2 biến. Trong trường hợp muốn thực hiện với nhiều biến có thể sử dụng hàm corr.test của package psych

## Call:corr.test(x = states[, 1:5], use = "complete", method = "pearson")
## Correlation matrix 
##            Population Income Illiteracy Life Exp Murder
## Population       1.00   0.21       0.11    -0.07   0.34
## Income           0.21   1.00      -0.44     0.34  -0.23
## Illiteracy       0.11  -0.44       1.00    -0.59   0.70
## Life Exp        -0.07   0.34      -0.59     1.00  -0.78
## Murder           0.34  -0.23       0.70    -0.78   1.00
## Sample Size 
## [1] 50
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##            Population Income Illiteracy Life Exp Murder
## Population       0.00   0.44       0.91     0.91   0.09
## Income           0.15   0.00       0.01     0.09   0.43
## Illiteracy       0.46   0.00       0.00     0.00   0.00
## Life Exp         0.64   0.02       0.00     0.00   0.00
## Murder           0.01   0.11       0.00     0.00   0.00
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option

Hàm này cho ra kết quả của cả hệ số tương quan và xác suất kiểm định

Xác suất > mức ý nghĩa alpha (= 0.05) có thể kết luận hệ số tương quan = 0 với mức ý nghĩa alpha.

Correlation Coefficient Part 2

Tran Quang Quy - Department of Computer Sciences & Technology

2021-May-15

Các phương pháp tính hệ số tương quan

Tương quan Pearson

Tương quan hạng Spearman

Tương quan hạng Kendall

Kiểm định thống kê tính tương quan giữa 2 biến số:

Một số dạng đồ thị, ma trận biểu diễn tương quan trong RStudio