[14-03-10]

15. 히스토그램(hist)

자료의 분포를 알아내는데 유용한 그래프이다. 일차원 그래프임.

hist(iris$Sepal.Width)

plot of chunk unnamed-chunk-1

여기서 막대의 너비, 즉 구간의 폭을 정하는 것이 히스토그램의 모양을 결정하는 중요한 요소이다.
기본값은 “Sturges"로 지정되어 있으며 이 방법은 막대의 너비를 [log2(n) + 1]로 지정한다.

hist()의 또 다른 파라미터는 freq이다. 기본값으누 NULL이며, 히스토그램의 막대가 각 구간별 데이터의 갯수로 그려진다.
FALSE이면 구간의 확률 밀도가 그려진다.

hist(iris$Sepal.Width, freq = FALSE)

plot of chunk unnamed-chunk-2

확률밀도 그래프는 그 면적의 넓이가 1인 그래프이다.

x <- hist(iris$Sepal.Width, freq = FALSE)
x

## $breaks
##  [1] 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4
## 
## $counts
##  [1]  4  7 13 23 36 24 18 10  9  3  2  1
## 
## $density
##  [1] 0.13333 0.23333 0.43333 0.76667 1.20000 0.80000 0.60000 0.33333
##  [9] 0.30000 0.10000 0.06667 0.03333
## 
## $mids
##  [1] 2.1 2.3 2.5 2.7 2.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3
## 
## $xname
## [1] "iris$Sepal.Width"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

sum(x$density) * 0.2

## [1] 1

위의 코드에서 볼 수 있듯이 density의 합에 구간의 너비 0.2를 곱하면 1이 나온다.

16. 밀도 그림(density)

bin의 경계에서 분포가 확연히 달라지지 않는 kernel density estimation에 의한 밀도 그림이다.
density()함수를 사용한다.

plot(density(iris$Sepal.Width))

plot of chunk unnamed-chunk-4

밀도그림과 히스토그램을 다음과 같이 합쳐서 그릴 수 있다.

hist(iris$Sepal.Width, freq = FALSE)
lines(density(iris$Sepal.Width))

plot of chunk unnamed-chunk-5

밀도 그림에 rug()함수를 사용해 실제 데이터의 위치를 표시할 수 있다.

plot(density(iris$Sepal.Width))
rug(iris$Sepal.Width)

plot of chunk unnamed-chunk-6

데이터가 중첩되는 곳이 있으므로 jitter를 사용해 겹쳐지지 않게 표시할 수 있다.

plot(density(iris$Sepal.Width))
rug(jitter(iris$Sepal.Width))

plot of chunk unnamed-chunk-7

17. 막대 그림(barplot)

barplot() 함수를 사용해 그린다.

barplot(tapply(iris$Sepal.Width, iris$Species, mean))

plot of chunk unnamed-chunk-8

Sepal.Width의 평균을 종별로 구하고 막대그림으로 그렸다.
tapply는 데이터, 그룹을 나누는 변수, 그룹에 적용할 함수를 인자로 받는다.

18. 파이 그래프(pie)

pie()함수를 사용해 그린다.
데이터의 비율을 알아보는데 적합하다.

cut(1:10, breaks = c(0, 5, 10))

##  [1] (0,5]  (0,5]  (0,5]  (0,5]  (0,5]  (5,10] (5,10] (5,10] (5,10] (5,10]
## Levels: (0,5] (5,10]

1부터 10을 (0, 5](0 < x ≤ 5) 와 (5, 10](5 < x ≤ 10)으로 나누어서, 1부터 10이 어느 구간에 포함되는지를 보여준다.

cut(1:10, breaks = 3)

##  [1] (0.991,4] (0.991,4] (0.991,4] (4,7]     (4,7]     (4,7]     (4,7]    
##  [8] (7,10]    (7,10]    (7,10]   
## Levels: (0.991,4] (4,7] (7,10]

1부터 10을 3개의 구간으로 나누는 예이다. 정확히는 어느 구간에 포함되는지를 표시해주는 함수이다.

이제 Sepal.Width를 10개의 구간으로 나눠보자.

cut(iris$Sepal.Width, breaks = 10)

##   [1] (3.44,3.68] (2.96,3.2]  (2.96,3.2]  (2.96,3.2]  (3.44,3.68]
##   [6] (3.68,3.92] (3.2,3.44]  (3.2,3.44]  (2.72,2.96] (2.96,3.2] 
##  [11] (3.68,3.92] (3.2,3.44]  (2.96,3.2]  (2.96,3.2]  (3.92,4.16]
##  [16] (4.16,4.4]  (3.68,3.92] (3.44,3.68] (3.68,3.92] (3.68,3.92]
##  [21] (3.2,3.44]  (3.68,3.92] (3.44,3.68] (3.2,3.44]  (3.2,3.44] 
##  [26] (2.96,3.2]  (3.2,3.44]  (3.44,3.68] (3.2,3.44]  (2.96,3.2] 
##  [31] (2.96,3.2]  (3.2,3.44]  (3.92,4.16] (4.16,4.4]  (2.96,3.2] 
##  [36] (2.96,3.2]  (3.44,3.68] (3.44,3.68] (2.96,3.2]  (3.2,3.44] 
##  [41] (3.44,3.68] (2.24,2.48] (2.96,3.2]  (3.44,3.68] (3.68,3.92]
##  [46] (2.96,3.2]  (3.68,3.92] (2.96,3.2]  (3.68,3.92] (3.2,3.44] 
##  [51] (2.96,3.2]  (2.96,3.2]  (2.96,3.2]  (2.24,2.48] (2.72,2.96]
##  [56] (2.72,2.96] (3.2,3.44]  (2.24,2.48] (2.72,2.96] (2.48,2.72]
##  [61] (2,2.24]    (2.96,3.2]  (2,2.24]    (2.72,2.96] (2.72,2.96]
##  [66] (2.96,3.2]  (2.96,3.2]  (2.48,2.72] (2,2.24]    (2.48,2.72]
##  [71] (2.96,3.2]  (2.72,2.96] (2.48,2.72] (2.72,2.96] (2.72,2.96]
##  [76] (2.96,3.2]  (2.72,2.96] (2.96,3.2]  (2.72,2.96] (2.48,2.72]
##  [81] (2.24,2.48] (2.24,2.48] (2.48,2.72] (2.48,2.72] (2.96,3.2] 
##  [86] (3.2,3.44]  (2.96,3.2]  (2.24,2.48] (2.96,3.2]  (2.48,2.72]
##  [91] (2.48,2.72] (2.96,3.2]  (2.48,2.72] (2.24,2.48] (2.48,2.72]
##  [96] (2.96,3.2]  (2.72,2.96] (2.72,2.96] (2.48,2.72] (2.72,2.96]
## [101] (3.2,3.44]  (2.48,2.72] (2.96,3.2]  (2.72,2.96] (2.96,3.2] 
## [106] (2.96,3.2]  (2.48,2.72] (2.72,2.96] (2.48,2.72] (3.44,3.68]
## [111] (2.96,3.2]  (2.48,2.72] (2.96,3.2]  (2.48,2.72] (2.72,2.96]
## [116] (2.96,3.2]  (2.96,3.2]  (3.68,3.92] (2.48,2.72] (2,2.24]   
## [121] (2.96,3.2]  (2.72,2.96] (2.72,2.96] (2.48,2.72] (3.2,3.44] 
## [126] (2.96,3.2]  (2.72,2.96] (2.96,3.2]  (2.72,2.96] (2.96,3.2] 
## [131] (2.72,2.96] (3.68,3.92] (2.72,2.96] (2.72,2.96] (2.48,2.72]
## [136] (2.96,3.2]  (3.2,3.44]  (2.96,3.2]  (2.96,3.2]  (2.96,3.2] 
## [141] (2.96,3.2]  (2.96,3.2]  (2.48,2.72] (2.96,3.2]  (3.2,3.44] 
## [146] (2.96,3.2]  (2.48,2.72] (2.96,3.2]  (3.2,3.44]  (2.96,3.2] 
## 10 Levels: (2,2.24] (2.24,2.48] (2.48,2.72] (2.72,2.96] ... (4.16,4.4]

나눠진 데이터를 파이 그래프로 그리려면 factor데이터는 그대로 사용할 수 없으며, 나눠진 각 구간에 몇개의 데이터가 있는지 세야한다. table()함수가 이런 목적으로 사용되며 factor값을 받아 분할표를 만든다.

table(rep(c("a", "b", "c"), 1:3))

## 
## a b c 
## 1 2 3

table(cut(iris$Sepal.Width, breaks = 10))

## 
##    (2,2.24] (2.24,2.48] (2.48,2.72] (2.72,2.96]  (2.96,3.2]  (3.2,3.44] 
##           4           7          22          24          50          18 
## (3.44,3.68] (3.68,3.92] (3.92,4.16]  (4.16,4.4] 
##          10          11           2           2

이 table을 이용하여 파이그래프를 그려보자.

pie(table(cut(iris$Sepal.Width, breaks = 10)))

plot of chunk unnamed-chunk-13

각 구간에 포함된 데이터가 얼마인지 대충 눈으로 확인할 수 있는데 유용한 것 같다.

19. 모자이크 플롯(mosaicplot)

범주형 다변량 데이터를 표현하는데 적합한 그래프이다.
mosaicplot()함수를 사용해 그린다.
사각형들이 그래프에 나열되며, 각 사각형의 널비가 각 범주에 속한 데이터의 수에 해당한다.

str(Titanic)

##  table [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"

Titanic은 table클래스의 인스턴스이다.

가장 간단한 모자이크 플롯을 그리는 방법은 데이터를 그대로 mosaicplot()에 넘기는 것이다.

Titanic

## , , Age = Child, Survived = No
## 
##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0
## 
## , , Age = Adult, Survived = No
## 
##       Sex
## Class  Male Female
##   1st   118      4
##   2nd   154     13
##   3rd   387     89
##   Crew  670      3
## 
## , , Age = Child, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st     5      1
##   2nd    11     13
##   3rd    13     14
##   Crew    0      0
## 
## , , Age = Adult, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st    57    140
##   2nd    14     80
##   3rd    75     76
##   Crew  192     20


mosaicplot(Titanic, color = TRUE)

plot of chunk unnamed-chunk-15

color=TRUE는 사각형에 음영을 넣어서 구분하기 쉽게 한다.
하지만 모든 조건을 나열해 그림을 그리면 오히려 개별그룹에 대한 분포를 살펴보기 불편하다. 일부 그룹에 대해서만 살펴보려면 다음과 같이 한다.

mosaicplot(~Class + Survived, data = Titanic, color = TRUE)

plot of chunk unnamed-chunk-16

20. 산점도 행렬

산점도 행렬은 다변량 데이터에서 변수 쌍간의 산점도 행렬을 그린 그래프이다.
pairs()함수를 이용해 그린다.

pairs(~Sepal.Width + Sepal.Length + Petal.Width + Petal.Length, data = iris, 
    col = c("red", "green", "blue")[iris$Species])

plot of chunk unnamed-chunk-17

c("red”,“green”,“blue”)[iris$Species]의 작동원리는 무엇일까…?

c("red", "green", "blue")[iris$Species]

##   [1] "red"   "red"   "red"   "red"   "red"   "red"   "red"   "red"  
##   [9] "red"   "red"   "red"   "red"   "red"   "red"   "red"   "red"  
##  [17] "red"   "red"   "red"   "red"   "red"   "red"   "red"   "red"  
##  [25] "red"   "red"   "red"   "red"   "red"   "red"   "red"   "red"  
##  [33] "red"   "red"   "red"   "red"   "red"   "red"   "red"   "red"  
##  [41] "red"   "red"   "red"   "red"   "red"   "red"   "red"   "red"  
##  [49] "red"   "red"   "green" "green" "green" "green" "green" "green"
##  [57] "green" "green" "green" "green" "green" "green" "green" "green"
##  [65] "green" "green" "green" "green" "green" "green" "green" "green"
##  [73] "green" "green" "green" "green" "green" "green" "green" "green"
##  [81] "green" "green" "green" "green" "green" "green" "green" "green"
##  [89] "green" "green" "green" "green" "green" "green" "green" "green"
##  [97] "green" "green" "green" "green" "blue"  "blue"  "blue"  "blue" 
## [105] "blue"  "blue"  "blue"  "blue"  "blue"  "blue"  "blue"  "blue" 
## [113] "blue"  "blue"  "blue"  "blue"  "blue"  "blue"  "blue"  "blue" 
## [121] "blue"  "blue"  "blue"  "blue"  "blue"  "blue"  "blue"  "blue" 
## [129] "blue"  "blue"  "blue"  "blue"  "blue"  "blue"  "blue"  "blue" 
## [137] "blue"  "blue"  "blue"  "blue"  "blue"  "blue"  "blue"  "blue" 
## [145] "blue"  "blue"  "blue"  "blue"  "blue"  "blue"

21. 투시도(persp), 등고선 그래프(contour)

투시도는 3차원 데이터를 마치 투시한 것처럼 그린 것으로, persp()함수로 그린다.
persp(x그리드, y그리드, 각 grid점에서의 z값)형태이다.
outer()함수를 이용해 투시도를 그리는데 유용하게 쓸 수 있다.

outer(1:5, 1:3, "+")

##      [,1] [,2] [,3]
## [1,]    2    3    4
## [2,]    3    4    5
## [3,]    4    5    6
## [4,]    5    6    7
## [5,]    6    7    8

outer(1:5, 1:3, function(x, y) {
    x + y
})

##      [,1] [,2] [,3]
## [1,]    2    3    4
## [2,]    3    4    5
## [3,]    4    5    6
## [4,]    5    6    7
## [5,]    6    7    8

이변량 정규분포를 그려보자. X와 Y에 대해 Z는 확률밀도인데, 다변량 정규분포의 확률밀도는 dmvnorm()으로 계산한다.

다음의 예시를 살펴보자.
x=0, y=0에 대해 x,y의 평균이 각각 0이고, 공분산행렬이 단위행렬일 때의 확률밀도를 구하는 예이다.
dmvnorm(c(x, y), mean, sigma)의 형식이다.

library(mvtnorm)
dmvnorm(c(0, 0), rep(0, 2), diag(2))

## [1] 0.1592

다음 코드는 seq(-3,3, 0.1)의 x, y축 그리드 조합에 대해 z축 값을 구한다.

x <- seq(-3, 3, 0.1)
y <- x

dmvnorm()의 평균의 기본값은 영행렬이고, 공분산 행렬의 기본값은 단위행렬이다. 따라서 다음과 같다.

f <- function(x, y) {
    dmvnorm(cbind(x, y))
}

이제 투시도를 그려보자.

persp(x, y, outer(x, y, f), theta = 30, phi = 30)

plot of chunk unnamed-chunk-23

theta와 phi는 그림의 기울어진 각도를 지정하는 인자이다.

등고선 그래프는 투사도와 유사하지만, 3차원대신 값이 같은 곳들을 선으로 연결한 등고선을 이용한 2차원 그래프이다.
contour()를 사용하며, 인자는 persp()에 똑같다.

contour(x, y, outer(x, y, f))

plot of chunk unnamed-chunk-24