산점도로 상관관계 시각화

아래와 같이 산점도를 사용하여 R에서 두 변수 간의 상관관계를 그릴 수 있습니다. 다음 코드 블록의 마지막 줄에서는 그래프에 상관 계수를 추가할 수 있습니다.

# Data generation
set.seed(1)
x <- 1:100
y <- x + rnorm(100, mean = 0, sd = 15)

# Creating the plot
plot(x, y, pch = 19, col = "lightblue")

# Regression line
abline(lm(y ~ x), col = "red", lwd = 3)

# Pearson correlation
text(paste("Correlation:", round(cor(x, y), 2)), x = 80, y = 60)

pairs함수로 상관관계 시각화

산점도 행렬을 만드는 가장 일반적인 기능은 pairs함수입니다. 설명을 위해 잘 알려진 데이터 세트를 사용하겠습니다 iris. pairs기능을 사용하면 데이터 프레임에서 쌍 또는 상관관계 도표를 만들 수 있습니다. 원하는 경우 수식을 지정할 수도 있습니다.

# 샘플 데이터
data <- iris[, 1:4] # Numerical variables
groups <- iris[, 5] # Factor variable (groups)

# Plot correlation matrix
pairs(data)

# Equivalent with a formula
pairs(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris)

# Equivalent but using the plot function
plot(data) 

함수는 여러 인수로 사용자 정의할 수 있습니다. 다음 예에서는 데이터 요소를 그룹별로 색상을 지정하여 산점도를 완전히 사용자 정의하는 방법을 보여줍니다.

pairs(data,                     # Data frame of variables
      labels = colnames(data),  # Variable names
      pch = 21,                 # Pch symbol
      bg = rainbow(3)[groups],  # Background color of the symbol (pch 21 to 25)
      col = rainbow(3)[groups], # Border color of the symbol
      main = "Iris dataset",    # Title of the plot
      row1attop = TRUE,         # If FALSE, changes the direction of the diagonal
      gap = 0.5,                  # Distance between subplots
      cex.labels = NULL,        # Size of the diagonal text
      font.labels = 1)          # Font style of the diagonal text

# Function to add histograms
panel.hist <- function(x, ...) {
    usr <- par("usr")
    on.exit(par(usr))
    par(usr = c(usr[1:2], 0, 1.5))
    his <- hist(x, plot = FALSE)
    breaks <- his$breaks
    nB <- length(breaks)
    y <- his$counts
    y <- y/max(y)
    rect(breaks[-nB], 0, breaks[-1], y, col = rgb(0, 1, 1, alpha = 0.5), ...)
    # lines(density(x), col = 2, lwd = 2) # Uncomment to add density lines
}

# Creating the scatter plot matrix
pairs(data,
      upper.panel = NULL,         # Disabling the upper panel
      diag.panel = panel.hist)    # Adding the histograms
## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

# Function to add correlation coefficients
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
    usr <- par("usr")
    on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    Cor <- abs(cor(x, y)) # Remove abs function if desired
    txt <- paste0(prefix, format(c(Cor, 0.123456789), digits = digits)[1])
    if(missing(cex.cor)) {
        cex.cor <- 0.4 / strwidth(txt)
    }
    text(0.5, 0.5, txt,
         cex = 1 + cex.cor * Cor) # Resize the text by level of correlation
}

# Plotting the correlation matrix
pairs(data,
      upper.panel = panel.cor,    # Correlation panel
      lower.panel = panel.smooth) # Smoothed regression lines
## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

# install.packages("gclus")
library(gclus)
## 필요한 패키지를 로딩중입니다: cluster
# Correlation in absolute terms
corr <- abs(cor(data)) 

colors <- dmat.color(corr)
order <- order.single(corr)

cpairs(data,                    # Data frame of variables
       order,                   # Order of the variables
       panel.colors = colors,   # Matrix of panel colors
       border.color = "grey70", # Borders color
       gap = 0.45,              # Distance between subplots
       main = "Ordered variables colored by correlation", # Main title
       show.points = TRUE,      # If FALSE, removes all the points
       pch = 21,                # pch symbol
       bg = rainbow(3)[iris$Species]) # Colors by group

PerformanceAnalytics 패키지 의 chart.Correlation 기능은 히스토그램 , 밀도 함수, 매끄러운 회귀선 및 해당 유의 수준과의 상관 계수를 사용하여 R에서 상관 관계 도표를 생성하는 지름길입니다(별표가 없으면 변수는 통계적으로 유의하지 않지만 1, 2, 3 별표는 해당 변수가 각각 10%, 5% 및 1% 수준에서 유의함을 의미합니다.)

# install.packages("PerformanceAnalytics")
library(PerformanceAnalytics)
## 필요한 패키지를 로딩중입니다: xts
## 필요한 패키지를 로딩중입니다: zoo
## 
## 다음의 패키지를 부착합니다: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 다음의 패키지를 부착합니다: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
## 
##     legend
chart.Correlation(data, histogram = TRUE, method = "pearson")
## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

The package pysch provides two interesting functions to create correlation plots in R. The pairs.panel function is an extension of the pairs function that allows you to easily add regression lines, histograms, confidence intervals, … and customize several additional arguments.

# install.packages("psych")
library(psych)

pairs.panels(data,
             smooth = TRUE,      # If TRUE, draws loess smooths
             scale = FALSE,      # If TRUE, scales the correlation text font
             density = TRUE,     # If TRUE, adds density plots and histograms
             ellipses = TRUE,    # If TRUE, draws ellipses
             method = "pearson", # Correlation method (also "spearman" or "kendall")
             pch = 21,           # pch symbol
             lm = FALSE,         # If TRUE, plots linear fit rather than the LOESS (smoothed) fit
             cor = TRUE,         # If TRUE, reports correlations
             jiggle = FALSE,     # If TRUE, data points are jittered
             factor = 2,         # Jittering factor
             hist.col = 4,       # Histograms color
             stars = TRUE,       # If TRUE, adds significance level with stars
             ci = TRUE)          # If TRUE, adds confidence intervals

corrgram function

On the one hand, the corrgram package calculates the correlation of the data and draws correlograms. The function of the same name allows customization via panel functions. As an example, you can create a correlogram in R where the upper panel shows pie charts and the lower panel shows shaded boxes with the following code:

# install.packages("corrgram")
library(corrgram)
## 
## 다음의 패키지를 부착합니다: 'corrgram'
## The following object is masked _by_ '.GlobalEnv':
## 
##     panel.cor
corrgram(data,
         order = TRUE,              # If TRUE, PCA-based re-ordering
         upper.panel = panel.pie,   # Panel function above diagonal
         lower.panel = panel.shade,  # Panel function below diagonal
         text.panel = panel.txt,    # Panel function of the diagonal
         main = "Correlogram")      # Main title

apropos("panel.")
##  [1] "pairs.panels"       "panel.bar"          "panel.conf"        
##  [4] "panel.cor"          "panel.cor"          "panel.density"     
##  [7] "panel.ellipse"      "panel.fill"         "panel.hist"        
## [10] "panel.lines.its"    "panel.lines.tis"    "panel.lines.ts"    
## [13] "panel.lines.zoo"    "panel.minmax"       "panel.pie"         
## [16] "panel.plot.custom"  "panel.plot.default" "panel.points.its"  
## [19] "panel.points.tis"   "panel.points.ts"    "panel.points.zoo"  
## [22] "panel.polygon.its"  "panel.polygon.tis"  "panel.polygon.ts"  
## [25] "panel.polygon.zoo"  "panel.pts"          "panel.rect.its"    
## [28] "panel.rect.tis"     "panel.rect.ts"      "panel.rect.zoo"    
## [31] "panel.segments.its" "panel.segments.tis" "panel.segments.ts" 
## [34] "panel.segments.zoo" "panel.shade"        "panel.smooth"      
## [37] "panel.text.its"     "panel.text.tis"     "panel.text.ts"     
## [40] "panel.text.zoo"     "panel.txt"

There are several panel functions that you can use. Using the apropos function you can list all of them:

corrplot and corrplot.mixed functions

On the other hand, the corrplot package is a very flexible package, which allows creating a wide variety of correlograms with a single function. The most common arguments of the main function are described below, but we recommend you to call ?corrplot for additional details. Note that for this function you need to pass the correlation matrix instead of the variables.

# install.packages("corrplot")
library(corrplot)
## corrplot 0.92 loaded
corrplot(cor(data),        # Correlation matrix
         method = "shade", # Correlation plot method
         type = "full",    # Correlation plot style (also "upper" and "lower")
         diag = TRUE,      # If TRUE (default), adds the diagonal
         tl.col = "black", # Labels color
         bg = "white",     # Background color
         title = "",       # Main title
         col = NULL)       # Color palette

par(mfrow = c(2, 3))

# Circles
corrplot(cor(data), method = "circle",
        title = "method = 'circle'",
        tl.pos = "n", mar = c(2, 1, 3, 1)) 
# Squares 
corrplot(cor(data), method = "square",
        title = "method = 'square'",
        tl.pos = "n", mar = c(2, 1, 3, 1)) 
# Ellipses
corrplot(cor(data), method = "ellipse",
         title = "method = 'ellipse'",
         tl.pos = "n", mar = c(2, 1, 3, 1)) 
# Correlations
corrplot(cor(data), method = "number",
         title = "method = 'number'",
         tl.pos = "n", mar = c(2, 1, 3, 1)) 
# Pie charts
corrplot(cor(data), method = "pie",
         title = "method = 'pie'",
         tl.pos = "n", mar = c(2, 1, 3, 1)) 
# Colors
corrplot(cor(data), method = "color",
         title = "method = 'color'",
         tl.pos = "n", mar = c(2, 1, 3, 1)) 

par(mfrow = c(1, 1))

This function also allows clustering the data. The clustering methods according to the documentation are: “original” (default order), “AOE” (angular order of eigenvectors), “FPC” (first principal component order), “hclust” (hierarchical clustering order) and “alphabet” (alphabetical order).

If you chose hierarchical clustering you can select between the following methods: “ward”, “ward.D”, “ward.D2”, “single”, “complete”, “average”, “mcquitty”, “median” and “centroid”. In this case, you can also create clustering with rectangles. An example is shown in the following block of code:

The argument method allows you to select between “circle” (default), “square”, “ellipse”, “number”, “shade”, “pie”, and “color”. As we previously used the shaded method, we show the remaining on the following plot:

corrplot(cor(data),
         method = "circle",       
         order = "hclust",         # Ordering method of the matrix
         hclust.method = "ward.D", # If order = "hclust", is the cluster method to be used
         addrect = 2,              # If order = "hclust", number of cluster rectangles
         rect.col = 3,             # Color of the rectangles
         rect.lwd = 3)             # Line width of the rectangles

corrplot.mixed(cor(data),
               lower = "number", 
               upper = "circle",
               tl.col = "black")

Finally, the corrplot.mixed function of the package allows drawing correlograms with mixed methods. In this case, you can mix the correlation plot methods setting the desired to the lower (below diagonal) and upper (above diagonal) arguments.