## corrplot 0.84 loaded

关联矩阵分析

散点图-双变量关系的图形化表示

相关性 correlation

  • 两个变量间的关系到底有多强?

  • 取值范围[-1,1]

  • 符号正负表示方向

  • 绝对值大小表示强度

接近完美相关:

强相关:

弱相关相关:

不相关:

负相关:

非线性:

相关系数计算公式(标准化的协方差)

\[correlation = r(x, y)=\frac{Cov(x, y)}{\delta_x*\delta_y}\]

\[Cov(x, y) = \frac{\sum ^n _{i=1} (x_i-\bar{x})* (y_i-\bar{y})}{n}\]

\[\delta _x = \sqrt{\frac{\sum ^n _{i=1} (x_i-\bar{x})^2}{n}}\]

相关性的理解

尼古拉斯凯奇的电影与在游泳池溺亡的人数有关系吗?

摇滚乐与美国石油产量有关系吗?

高速路死亡率与鲜柠檬进口量有关系吗?

用R实现关联分析:

记得在自己本地电脑上安装install.packages("corrplot")

查看数据结构和内容:

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

所有变量间的相关系数:

cor(mtcars)
##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

保存所有相关系数为M,保留两位小数:

library(corrplot)
M<-cor(mtcars)
round(M,2)
##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00

corrplot主要有数种展现相关性的方法:“circle”,“pie”, “color”, “number”。 一般表达式为:

corrplot(corr, method="circle")
library(corrplot)
M<-cor(mtcars)
corrplot(M, method="circle")

可以根据需要改变相关性的表现形式:

library(corrplot)
M<-cor(mtcars)
corrplot(M, method="pie")

library(corrplot)
M<-cor(mtcars)
corrplot(M, method="color")

library(corrplot)
M<-cor(mtcars)
corrplot(M, method="number")

只保留一半的图形,另一半与之完全对称:

library(corrplot)
M<-cor(mtcars)
corrplot(M, type="upper")

library(corrplot)
M<-cor(mtcars)
corrplot(M, type="lower")

改字体颜色以及字体方向:

library(corrplot)
M<-cor(mtcars)
corrplot(M, type="upper", tl.col="black", tl.srt=45)

改变背景颜色:

library(corrplot)
M<-cor(mtcars)
# Change background color to lightblue
corrplot(M, type="upper", col=c("black", "white"),
         bg="lightblue")

使用下面的公式,计算相关性的P值,是否显著相关:

# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
    mat <- as.matrix(mat)
    n <- ncol(mat)
    p.mat<- matrix(NA, n, n)
    diag(p.mat) <- 0
    for (i in 1:(n - 1)) {
        for (j in (i + 1):n) {
            tmp <- cor.test(mat[, i], mat[, j], ...)
            p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
        }
    }
  colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
  p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(mtcars)
head(p.mat[, 1:5])
##               mpg          cyl         disp           hp         drat
## mpg  0.000000e+00 6.112687e-10 9.380327e-10 1.787835e-07 1.776240e-05
## cyl  6.112687e-10 0.000000e+00 1.802838e-12 3.477861e-09 8.244636e-06
## disp 9.380327e-10 1.802838e-12 0.000000e+00 7.142679e-08 5.282022e-06
## hp   1.787835e-07 3.477861e-09 7.142679e-08 0.000000e+00 9.988772e-03
## drat 1.776240e-05 8.244636e-06 5.282022e-06 9.988772e-03 0.000000e+00
## wt   1.293959e-10 1.217567e-07 1.222320e-11 4.145827e-05 4.784260e-06
library(corrplot)
M<-cor(mtcars)

# Specialized the insignificant value according to the significant level
corrplot(M, type="upper", order="hclust", 
         p.mat = p.mat, sig.level = 0.01)

让不显著的关联消失:

library(corrplot)
M<-cor(mtcars)

# Leave blank on no significant coefficient
corrplot(M, type="upper", 
         p.mat = p.mat, sig.level = 0.01, insig = "blank")

还可以调整其它参数,作出自己想要的结果:

library(corrplot)
M<-cor(mtcars)

col <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method="color", col=col(200),  
         type="upper", 
         addCoef.col = "black", # Add coefficient of correlation
         tl.col="black", tl.srt=45, #Text label color and rotation
         # Combine with significance
         p.mat = p.mat, sig.level = 0.01, insig = "blank", 
         # hide correlation coefficient on the principal diagonal
         diag=FALSE 
         )

使用 corrplot()函数 作出炫酷的相关性矩阵 correlation matrix,你学会了吗?

再教你一招,来补一刀!

install.packages("PerformanceAnalytics")

library("PerformanceAnalytics")
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
my_data <- mtcars[, c(1,3,4,5,6,7)]
chart.Correlation(my_data, histogram=TRUE, pch=19)