1 R, R-Studio 특징, 개요 및 설치

1.1 R, Rstudio 소개

R은 통계 계산과 그래픽을 위한 프로그래밍 언어인 프리웨어입니다.

R은 다양한 통계 기법과 수치 해석 기법을 지원합니다.

R은 사용자가 제작한 패키지¹를 추가하여 기능을 확장할 수 있습니다.

R의 또 다른 강점은 그래픽 기능으로 수학 기호를 포함할 수 있는 출판물 수준의 그래프를 제공합니다.

R은 통계 계산과 소프트웨어 개발을 위한 환경이 필요한 통계학자와 연구자들 뿐만 아니라, 행렬 계산을 위한 도구로서도 사용될 수 있습니다.

Rstudio: 오픈 소스인 R를 좀 더 편하게 사용하기 위해 개발된 프로그램(IDE, 통합 개발 환경)입니다.

1.2 R, Rstudio, rtools40, KoNLP 설치

R 설치하기(가급적 관리자 권한으로 설치) (https://ftp.harukasan.org/CRAN/)

R Studio 설치하기(가급적 관리자 권한으로 설치) (https://rstudio.com/products/rstudio/download/#download)

rtools40 설치하기(c:/Rtools에 설치) (https://cran.r-project.org/bin/windows/Rtools/index.html)

java, rJava 설치하기
install.packages(“multilinguer”)
library(multilinguer)
install_jdk()

의존성 패키지 설치하기
install.packages(c(“hash”, “tau”, “Sejong”, “RSQLite”, “devtools”, “bit”, “rex”, “lazyeval”, “htmlwidgets”, “crosstalk”, “promises”, “later”, “sessioninfo”, “xopen”, “bit64”, “blob”, “DBI”, “memoise”, “plogr”, “covr”, “DT”, “rcmdcheck”, “rversions”), type = “binary”)

github 버전 설치하기
install.packages(“remotes”)

KoNLP 설치하기(64bit에서만 동작)
remotes::install_github(‘haven-jeon/KoNLP’, upgrade = “never”, INSTALL_opts=c(“–no-multiarch”))

1.3 R 데이터의 구조 (data structure)

1.3.1 R의 핵심, 벡터

벡터 형식은 R 의 핵심

벡터의 원소는 모두 같은 ‘형식(type)’을 가져야 함

n개의 문자열 또는 n개의 정수 원소로 이뤄진 벡터 생성 가능, (n은 자연수)

1개 이상의 정수와 1개 이상의 문자열을 동시에 원소로 갖는 벡터 생성 불가능

x <- c(1,3,5,7,9) 
x

## [1] 1 3 5 7 9

length(x)

## [1] 5

mode(x)

## [1] "numeric"

1.3.2 문자열

숫자 형식이 아닌 문자 형식의 원소를 갖는 벡터

y <- "abc" ; y ; length(y) ; mode(y)

## [1] "abc"

## [1] 1

## [1] "character"

z <- c("abc","def") ; z ; length(z) ; mode(z)

## [1] "abc" "def"

## [1] 2

## [1] "character"

1.3.3 행렬

수학에서의 행렬과 같은 개념을 가짐

기술적으로 행렬은 벡터지만 행의 개수와 열의 개수라는 두 가지 속성을 추가로 가짐

rbind() 함수, cbind()함수를 이용해서 벡터들의 결합으로 행렬을 만들 수 있습니다.

m <- rbind(c(1,4), c(2,2)) ; m

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    2

m %*% c(1,1) # m과 (1,1) 벡터와 행렬곱 계산

##      [,1]
## [1,]    5
## [2,]    4

m1 <- cbind(c(1,4), c(2,2)) ; m1

##      [,1] [,2]
## [1,]    1    2
## [2,]    4    2

1.3.4 리스트

리스트는 벡터와는 달리 여러 데이터형을 담을 수 있습니다.

리스트의 원소들은 이름을 부여하여 접근 가능: 리스트명$변수명²

x <- list(u=2, v="abc") ; x ; x$u

## $u
## [1] 2
## 
## $v
## [1] "abc"

## [1] 2

1.3.5 데이터프레임(data.frame)

일반적인 데이터 세트는 여러 형식의 데이터를 포함하고 있습니다.
예) 거래 데이터 세트에는 제품명, 공급회사 명 등의 문자열 데이터와 단가, 판매량 등 숫자 데이터가 섞여 있습니다. 이런 경우 행렬 대신 ‘데이터 프레임’을 사용하면 됩니다.

데이터 프레임은 리스트의 집합임.

d <- data.frame(list(kids=c("Jack","Jill"), ages=c(12,10))) ; d

##   kids ages
## 1 Jack   12
## 2 Jill   10

1.4 작업 Directory 확인 및 변경

현재 작업 Directory 확인: getwd()

작업 Directory 변경: setwd(“C:/…/”)

getwd()

## [1] "C:/Users/SAMSUNG/Documents/R/R_study"

setwd("C:/"); getwd()

## [1] "C:/"

setwd("C:/Users/SAMSUNG/Documents"); getwd()

## [1] "C:/Users/SAMSUNG/Documents"

1.4.1 작업 Directory 위치 확인 방법

아래 그림처럼 Windows탐색기에서 파일을 선택하고 우클릭해서 속성>보안에 들어가서 directory정보를 확인합니다.

해당 directory 복사하되 ‘\’를’/’로 변경해서 써야 합니다.

1.5 도움말 사용

help() or ? : 관련 정보를 알고 싶을 때 사용합니다.

example() : 관련 예제를 보고 싶을 때 사용합니다.

help(seq)  #seq 함수가 궁금할 때
?seq  #seq 함수가 궁금할 때
?">"  # “>”가 궁금할 때
?"for" # “for”가 궁금할 때
example(seq) #seq 함수의 예제가 궁금할 때

1.6 R의 함수

함수: 입력값을 넣고 이를 기반으로 계산해 결과값을 출력하는 명령들의 묶음.

예를 들어 제시된 수 중 홀수의 개수를 세는 함수 oddcount()를 아래와 같이 만들어 사용할 수 있습니다.

oddcount <- function(x) {
  k <- 0 #assign 0 to k
  for (n in x) { 
    if (n %% 2 == 1) k <- k+1  # %% 는 나머지 연산
    }
  return(k)
  }
oddcount(c(1,3,5)); oddcount(c(1,2,3,7,9))

## [1] 3

## [1] 4

1.7 산술 연산자

1.8 비교 연산자

1.9 논리 연산자

2 R 기본 문법

2.1 할당 및 논리문

할당 : = 또는 <-

비교 논리문 : ==

A = 2 ; A <- 2  # A는 2로 할당
A

## [1] 2

A == 2 # A는 2인지 판단해라

## [1] TRUE

A != 2 # A는 2가 아닌지 판단해라

## [1] FALSE

2.2 c() 활용

c()는 벡터를 만드는데 사용됩니다.

데이터나 객체들을 하나로 결합할 수 있습니다. recursive=결합 여부로 디폴트 값(단순 결합)은 FALSE, 연결구조일 때는 TRUE입니다.

A = c(1,2,3,4,5)
a <- c(6,7,8,9,10)
A ; a

## [1] 1 2 3 4 5

## [1]  6  7  8  9 10

a <- c(list(A=c(a=2014),B=c(b=01345)),recursive=T); a

##  A.a  B.b 
## 2014 1345

b <- c(list(A=c(a=2014),B=c(b=01345)),recursive=F); b

## $A
##    a 
## 2014 
## 
## $B
##    b 
## 1345

2.3 rep(), seq()를 활용한 수열 만들기

seq(시작,끝,증가분)

rep(시작:끝,반복횟수)

sequence 연산자( : ) 이용 방법도 있습니다. 시작:끝 으로 +1 또는 -1인 등차수열 생성합니다.³

x1 = c(1:10) # 1 ~ 10 까지 1씩 증가하는 수열 생성
x2 = seq(1,10,2) # 1 ~ 10까지 2씩 증가하는 수열 생성 
y1 = rep(1:3,2) # 1~3을 2번 반복
y2 = rep(1:3,each=2) # 1~3을 순서대로 각각 2번씩 반복
x1;x2;y1;y2

##  [1]  1  2  3  4  5  6  7  8  9 10

## [1] 1 3 5 7 9

## [1] 1 2 3 1 2 3

## [1] 1 1 2 2 3 3

1:10

##  [1]  1  2  3  4  5  6  7  8  9 10

options(digits=2); -pi:pi; -2.345:7

## [1] -3.14 -2.14 -1.14 -0.14  0.86  1.86  2.86

##  [1] -2.35 -1.35 -0.35  0.65  1.65  2.65  3.65  4.65  5.65  6.65

T:4; -5:F; T:F

## [1] 1 2 3 4

## [1] -5 -4 -3 -2 -1  0

## [1] 1 0

1:5 +3; 1:5 *3

## [1] 4 5 6 7 8

## [1]  3  6  9 12 15

1:3+2i; 1+2i:5

## [1] 1+2i 2+2i 3+2i

## Warning: 강제형변환중에 허수부분이 버려졌습니다

## [1] 1 2 3 4 5 6

함수 seq()의 인수가 두 개일 때는 연산자(:)를 썼을 때와 같음.

from=초기값, to=최종값, by=증감 정도 외에 from=초기값 to=최종값 length=최종 벡터 길이(총 순서값 개수) 방법도 있습니다.

along=x along.with=x x객체의 길이와 동일하게 함.

seq(1,30,length.out = 10)

##  [1]  1.0  4.2  7.4 10.7 13.9 17.1 20.3 23.6 26.8 30.0

seq(1,30,length = 10)

##  [1]  1.0  4.2  7.4 10.7 13.9 17.1 20.3 23.6 26.8 30.0

x <- c(1,2,3,4,5)
seq(2,20,along = x)

## [1]  2.0  6.5 11.0 15.5 20.0

seq(7,along.with = x)

## [1]  7  8  9 10 11

seq(1,10,along.with=c(1,2,3))

## [1]  1.0  5.5 10.0

sequence(1:3) # sequence()는 각 원소에 대해 seq() 시행

## [1] 1 1 2 1 2 3

sequence(c(1,3,5))

## [1] 1 1 2 3 1 2 3 4 5

함수 rep(): 값을 반복하여 나열하는 함수로 첫 번째 인수는 반복할 원소값, 두 번째 인수는 반복 횟수.

each = 원소 각각의 반복 횟수.

len(length.out) = 반복한 후의 최종 벡터 길이.

times=전체 반복 횟수.

rep(1:3,2)

## [1] 1 2 3 1 2 3

rep(1:3,times=2)

## [1] 1 2 3 1 2 3

rep(1:3,each=2)

## [1] 1 1 2 2 3 3

rep(1:3,each=2,times=3)

##  [1] 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3

rep(1:3,times=3,each=2)

##  [1] 1 1 2 2 3 3 1 1 2 2 3 3 1 1 2 2 3 3

rep(1:3,times=3,each=2,length=5)

## [1] 1 1 2 2 3

rep(1:3,times=3,each=2,len=8)

## [1] 1 1 2 2 3 3 1 1

rep(1:3,times=3,each=2,length.out=3)

## [1] 1 1 2

2.4 data.frame()을 통해 데이터 셋 만들기

R에서 분석을 진행할 때는 데이터 셋 타입을 data.frame()형태로 만드는 것이 좋습니다.

x1 <- c(1:9) # 1 ~ 9까지 1씩 증가하는 수열 생성
y1 <- rep(1:3,3) # 1~3을 3번 반복
DATA_SET = data.frame( X = x1, y = y1 )
DATA_SET

##   X y
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 1
## 5 5 2
## 6 6 3
## 7 7 1
## 8 8 2
## 9 9 3

2.5 괄호의 활용 ( ), { }

( ):실행 함수(function)과 함께 쓰입니다

{ }: 한번에 명령 수행하고자 할때 사용

A <- c(1,2,3,4,5) ; A

## [1] 1 2 3 4 5

{ 
  a <- 1
  b <- c(1,2,3)
  a+b
}

## [1] 2 3 4

2.6 [ ]: index

1차원 데이터일 때

A <- c(1,2,3,4,5)
A[2] # 2번째 값

## [1] 2

A[1:2] # 1,2번째 값

## [1] 1 2

A[-3] # 3번째 값 빼고

## [1] 1 2 4 5

A[c(1,2,4,5)] #1,2,4,5번째 값

## [1] 1 2 4 5

data.frame일 때

x1 <- c(1:9) # 1 ~ 9까지 1씩 증가하는 수열 생성
y1 <- rep(1:3,3) # 1~3을 3번 반복
DATA_SET = data.frame( X = x1, y = y1 )
DATA_SET[1,] # 1행 전부

##   X y
## 1 1 1

DATA_SET[,1] # 1열 전부

## [1] 1 2 3 4 5 6 7 8 9

DATA_SET[c(1,2,3),-2] # 1,2,3 행 & 2열 빼고 나머지

## [1] 1 2 3

2.7 변수 형태 (data type)

2.7.1 명목, 서열, 등간, 비율 척도 비교

2.7.2 as로 strings 성격 정의

x = c(1,2,3,4,5,6,7,8,9,10)
x1 = as.integer(x); str(x1)  # 정수 타입으로

##  int [1:10] 1 2 3 4 5 6 7 8 9 10

x2 = as.numeric(x); str(x2)  # 숫자 타입으로

##  num [1:10] 1 2 3 4 5 6 7 8 9 10

x3 = as.factor(x); str(x3)  # factor 타입으로

##  Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10

x4 = as.character(x); str(x4)  # 문자 타입으로

##  chr [1:10] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"

2.7.3 is로 strings 성격 판단

x=c(1,2,3,4,5,6,7,8,9,10) 
y=c("str",'str2',"str3","str4")
is.integer(x)

## [1] FALSE

is.numeric(x)

## [1] TRUE

is.double(x)

## [1] TRUE

is.character(y)

## [1] TRUE

is.factor(y)

## [1] FALSE

2.8 sample()활용 샘플 데이터 만들기

로또 번호 추첨 sample, 1 ~ 45 중에 6개의 숫자를 비복원 추출, replace = FALSE는 비복원 추출 의미

sample(1:45, 6, replace = FALSE)

## [1] 15 24 18 10 27 36

2.8.1 set.seed()

set.seed(): 무작위 결과값(실행할 때마다 다르게 나오는)을 고정시켜야 할 때 사용.
set.seed()에는 아무 숫자를 저장해주면 됩니다. 그 숫자에 함께 실행된 무작위 결과값이 저장되어 있습니다.

set.seed(110); sample(1:45, 6, replace = FALSE)

## [1] 20 24 38 41 16 39

2.8.2 가중치 고려

가중치 고려 가능. prob를 원소별로 다르게 두어 수행할 수 있습니다.
prob의 합이 굳이 1일 필요는 없으며 prob가 큰 인자가 더 추출 될 확률이 높게 됩니다.

sample(1:10, 5, replace = TRUE, prob = 1:10)

## [1]  2  6  7  6 10

2.8.3 rnorm: 정규분포 난수 발생

rnorm(10) # 표준정규분포 상 10개 난수 추출

##  [1] -0.045  1.484 -1.591  0.226  1.420  0.970  1.904 -0.700  1.778  0.206

rnorm(10,mean=10,sd=2) # 평균 10, 표준편차 2인 정규분포 상 10개 난수 추출

##  [1]  6.9 11.3 13.1  6.1 10.8  7.0 11.7  8.3 11.6 13.6

2.8.4 rbinom: 이항분포 난수 발생

rbinom(10,100,0.5) # 시도 100번에 성공확률 0.5인 경우 10개의 성공횟수 추출

##  [1] 46 49 51 56 52 53 50 55 49 50

2.9 for() 문

for(i in 1:5) { # i에 1 ~ 5까지의 정수를 차례대로 부여 
  print(i) # i를 출력 
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

LIST = seq(1,30,2) # LIST에 1 ~ 30 까지 2씩 증가하는 정수들의 벡터 생성
SPACE = c() # SPACE라는 원소가 없는 벡터 생성. 
for(i in LIST){ # i에 LIST에 속한 값들을 차례대로 부여 
  SPACE = c(SPACE,i*2) # SPACE에 i*2를 저장. 
} 
SPACE

##  [1]  2  6 10 14 18 22 26 30 34 38 42 46 50 54 58

2.9.1 for, while, repeat 구구단 2단 비교

2.9.1.1 for문

for(i in 1:9){
  x <- 2*i
  print(x)
}

## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18

2.9.1.2 while문

i <- 1 # while문 조건에 들어갈 변수값은 사전에 지정해야 함
while(i < 10){
  x <- 2*i
  print(x)
  i <- i+1
}

## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18

for문은 반복횟수를 정확히 알아야 하는 반면 while문은 조건에 따라 반복하기 때문에 반복횟수를 정확히 모를 때 사용하면 편리합니다.

options("scipen"=100, "digits"=4) # "scipen"값을 충분히 큰 수(100 정도)를 넣어주면 일반 숫자 형태 표시
i <- 1
while(1){
  i <- 2 * i
  print(i)
  if(i > 1000000) break
}

## [1] 2
## [1] 4
## [1] 8
## [1] 16
## [1] 32
## [1] 64
## [1] 128
## [1] 256
## [1] 512
## [1] 1024
## [1] 2048
## [1] 4096
## [1] 8192
## [1] 16384
## [1] 32768
## [1] 65536
## [1] 131072
## [1] 262144
## [1] 524288
## [1] 1048576

2.9.1.3 repeat 문

i <- 1
repeat{
  x <- 2 * i
  print(x)
  if(i>=9) break
  i <- i + 1
}

## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
## [1] 14
## [1] 16
## [1] 18

2.10 if~, else~ , if~, elseif~, else~, ifelse( , , )

2.10.1 조건이 하나일 경우 (if ~ else ~)

A = c(1,2,3,4,5)
if(7 %in% A){
  print("TRUE")
} else{ 
  print("FALSE")
}

## [1] "FALSE"

2.10.2 조건이 두개 이상일 경우 (if ~ elseif ~ else ~)

x <- 12087
if(x %% 3 == 0) {
  y <- "3의 배수" 
  print(y) 
} else if(x %% 3 == 1){ 
  y <- "나머지 1"
  print(y)
} else { 
  y <- "나머지 2" 
  print(y) 
}

## [1] "3의 배수"

2.10.3 ifelse()문은 ifelse(test, yes, no)형태로 2개 이상 조건일 때도 사용 가능합니다.

x <- 12087
ifelse(x %% 3 == 0,"3의 배수", ifelse(x %% 3 == 1,"나머지 1","나머지 2"))

## [1] "3의 배수"

2.11 R 패키지 설치, 활성화

install.packages(“ggplot2”) # ggplot2라는 패키지 설치
library(ggplot2) # ggplot2 패키지 활성화

2.12 연습문제

2.12.1 sample()을 활용해서 로또번호(1 ~ 45)에서 6개를 추첨하세요.

sample(1:45, 6, replace = F)

## [1] 40 24 10 38 45  7

2.12.2 다음의 수열로 구성 된 벡터를 생성하세요.

AV=(1,3,5,7,9,⋯,99) BV=(1,1,2,2,3,3,4,4,5,5)

AV <- c(seq(1,99,2)); AV

##  [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
## [26] 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99

BV <- c(rep(1:5,each=2)); BV

##  [1] 1 1 2 2 3 3 4 4 5 5

2.12.3 다음과 같은 행렬을 생성하세요.

1 2 3
4 5 6
7 8 9

m=rbind(c(1:3),c(4:6),c(7:9)); m

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

2.12.4 funtion()을 사용하여 다음의 함수를 만드세요.

Quadratic(x,y)=x^2+y+10

Quadratic <- function(x,y){
  result <- x^2 + y + 10
  print(result)
}
Quadratic(2,3)

## [1] 17

2.12.4.1 prompt값 받는 함수

아래와 같이 prompt로 부터 값을 받아 계산하는 함수도 가능합니다.

Quadratic2 <- function(x,y){
x <- as.numeric(readline(prompt = “x:”))
y <- as.numeric(readline(prompt = “y:”))
result <- x^2 + y + 10
print(result)
}
Quadratic2()

2.12.5 for()문을 활용하여 구구단을 만드세요.

for(i in 2:9){
  for(j in 1:9){
    result <- paste(i,"×",j,"=",i*j)
    print(result)
  }
}

## [1] "2 × 1 = 2"
## [1] "2 × 2 = 4"
## [1] "2 × 3 = 6"
## [1] "2 × 4 = 8"
## [1] "2 × 5 = 10"
## [1] "2 × 6 = 12"
## [1] "2 × 7 = 14"
## [1] "2 × 8 = 16"
## [1] "2 × 9 = 18"
## [1] "3 × 1 = 3"
## [1] "3 × 2 = 6"
## [1] "3 × 3 = 9"
## [1] "3 × 4 = 12"
## [1] "3 × 5 = 15"
## [1] "3 × 6 = 18"
## [1] "3 × 7 = 21"
## [1] "3 × 8 = 24"
## [1] "3 × 9 = 27"
## [1] "4 × 1 = 4"
## [1] "4 × 2 = 8"
## [1] "4 × 3 = 12"
## [1] "4 × 4 = 16"
## [1] "4 × 5 = 20"
## [1] "4 × 6 = 24"
## [1] "4 × 7 = 28"
## [1] "4 × 8 = 32"
## [1] "4 × 9 = 36"
## [1] "5 × 1 = 5"
## [1] "5 × 2 = 10"
## [1] "5 × 3 = 15"
## [1] "5 × 4 = 20"
## [1] "5 × 5 = 25"
## [1] "5 × 6 = 30"
## [1] "5 × 7 = 35"
## [1] "5 × 8 = 40"
## [1] "5 × 9 = 45"
## [1] "6 × 1 = 6"
## [1] "6 × 2 = 12"
## [1] "6 × 3 = 18"
## [1] "6 × 4 = 24"
## [1] "6 × 5 = 30"
## [1] "6 × 6 = 36"
## [1] "6 × 7 = 42"
## [1] "6 × 8 = 48"
## [1] "6 × 9 = 54"
## [1] "7 × 1 = 7"
## [1] "7 × 2 = 14"
## [1] "7 × 3 = 21"
## [1] "7 × 4 = 28"
## [1] "7 × 5 = 35"
## [1] "7 × 6 = 42"
## [1] "7 × 7 = 49"
## [1] "7 × 8 = 56"
## [1] "7 × 9 = 63"
## [1] "8 × 1 = 8"
## [1] "8 × 2 = 16"
## [1] "8 × 3 = 24"
## [1] "8 × 4 = 32"
## [1] "8 × 5 = 40"
## [1] "8 × 6 = 48"
## [1] "8 × 7 = 56"
## [1] "8 × 8 = 64"
## [1] "8 × 9 = 72"
## [1] "9 × 1 = 9"
## [1] "9 × 2 = 18"
## [1] "9 × 3 = 27"
## [1] "9 × 4 = 36"
## [1] "9 × 5 = 45"
## [1] "9 × 6 = 54"
## [1] "9 × 7 = 63"
## [1] "9 × 8 = 72"
## [1] "9 × 9 = 81"

2.12.6 function()을 사용하여 소수(Prime number)인지 아닌지 파악하는 함수를 만드세요.

is.prime <- function(num) {
  if (num == 2) {
    TRUE
    } else if (any(num %% 2:(num-1) == 0)){
      FALSE
      } else { 
        TRUE
        }
  }
is.prime(112312417)

## [1] FALSE

is.prime(5531)

## [1] TRUE

2.12.6.1 소수(prime number) 찾는 함수

find.prime <- function(num) {
  primenumber <- c()
  num <- readline("number?")
  for(x in 2:num){
    if (all(x %% 2:(x-1) != 0)) {
      primenumber <- c(primenumber,x)
    }
  }
  print(c(2,primenumber))
}

find.prime()

3 데이터 불러오기 및 변경, 필터링, 요약표

3.1 HR 데이터 불러오기

library(readr)
HR <- read.csv('C:/Users/SAMSUNG/Documents/R/R_study/HR_comma_sep.csv') #/로 경로 구분 
# HR <- read.csv('c:\\Users\\SAMSUNG\\Documents\\R\\R_study\\HR_comma_sep.csv') # \\로 경로 구분

head(HR) # shows first 6 rows

##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
## 4               0.72            0.87              5                  223
## 5               0.37            0.52              2                  159
## 6               0.41            0.50              2                  153
##   time_spend_company Work_accident left promotion_last_5years department salary
## 1                  3             0    1                     0      sales    low
## 2                  6             0    1                     0      sales medium
## 3                  4             0    1                     0      sales medium
## 4                  5             0    1                     0      sales    low
## 5                  3             0    1                     0      sales    low
## 6                  3             0    1                     0      sales    low

head(HR,n=3) # shows first 3 rows

##   satisfaction_level last_evaluation number_project average_montly_hours
## 1               0.38            0.53              2                  157
## 2               0.80            0.86              5                  262
## 3               0.11            0.88              7                  272
##   time_spend_company Work_accident left promotion_last_5years department salary
## 1                  3             0    1                     0      sales    low
## 2                  6             0    1                     0      sales medium
## 3                  4             0    1                     0      sales medium

tail(HR) # shows last 6 rows

##       satisfaction_level last_evaluation number_project average_montly_hours
## 14994               0.76            0.83              6                  293
## 14995               0.40            0.57              2                  151
## 14996               0.37            0.48              2                  160
## 14997               0.37            0.53              2                  143
## 14998               0.11            0.96              6                  280
## 14999               0.37            0.52              2                  158
##       time_spend_company Work_accident left promotion_last_5years department
## 14994                  6             0    1                     0    support
## 14995                  3             0    1                     0    support
## 14996                  3             0    1                     0    support
## 14997                  3             0    1                     0    support
## 14998                  4             0    1                     0    support
## 14999                  3             0    1                     0    support
##       salary
## 14994    low
## 14995    low
## 14996    low
## 14997    low
## 14998    low
## 14999    low

dim(HR) # returns the dimensions of data frame (i.e. number of rows and number of columns)

## [1] 14999    10

nrow(HR) # number of rows

## [1] 14999

ncol(HR) # number of columns

## [1] 10

str(HR) # structure of data frame - name, type and preview of data in each column

## 'data.frame':    14999 obs. of  10 variables:
##  $ satisfaction_level   : num  0.38 0.8 0.11 0.72 0.37 0.41 0.1 0.92 0.89 0.42 ...
##  $ last_evaluation      : num  0.53 0.86 0.88 0.87 0.52 0.5 0.77 0.85 1 0.53 ...
##  $ number_project       : int  2 5 7 5 2 2 6 5 5 2 ...
##  $ average_montly_hours : int  157 262 272 223 159 153 247 259 224 142 ...
##  $ time_spend_company   : int  3 6 4 5 3 3 4 5 5 3 ...
##  $ Work_accident        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ left                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ promotion_last_5years: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ department           : chr  "sales" "sales" "sales" "sales" ...
##  $ salary               : chr  "low" "medium" "medium" "low" ...

names(HR); colnames(HR) # both show the names attribute for a data frame

##  [1] "satisfaction_level"    "last_evaluation"       "number_project"       
##  [4] "average_montly_hours"  "time_spend_company"    "Work_accident"        
##  [7] "left"                  "promotion_last_5years" "department"           
## [10] "salary"

##  [1] "satisfaction_level"    "last_evaluation"       "number_project"       
##  [4] "average_montly_hours"  "time_spend_company"    "Work_accident"        
##  [7] "left"                  "promotion_last_5years" "department"           
## [10] "salary"

sapply(HR,class) # shows the class of each column in the data frame

##    satisfaction_level       last_evaluation        number_project 
##             "numeric"             "numeric"             "integer" 
##  average_montly_hours    time_spend_company         Work_accident 
##             "integer"             "integer"             "integer" 
##                  left promotion_last_5years            department 
##             "integer"             "integer"           "character" 
##                salary 
##           "character"

3.2 데이터 요약

HR[6:10] <- lapply(HR[6:10],as.factor) # factor로 타입 변경
summary(HR) # 기술통계, 최소,최대, 평균, 중앙값, 1,3사분기 값, fator이면 최빈값

##  satisfaction_level last_evaluation number_project average_montly_hours
##  Min.   :0.090      Min.   :0.360   Min.   :2.0    Min.   : 96         
##  1st Qu.:0.440      1st Qu.:0.560   1st Qu.:3.0    1st Qu.:156         
##  Median :0.640      Median :0.720   Median :4.0    Median :200         
##  Mean   :0.613      Mean   :0.716   Mean   :3.8    Mean   :201         
##  3rd Qu.:0.820      3rd Qu.:0.870   3rd Qu.:5.0    3rd Qu.:245         
##  Max.   :1.000      Max.   :1.000   Max.   :7.0    Max.   :310         
##                                                                        
##  time_spend_company Work_accident left      promotion_last_5years
##  Min.   : 2.0       0:12830       0:11428   0:14680              
##  1st Qu.: 3.0       1: 2169       1: 3571   1:  319              
##  Median : 3.0                                                    
##  Mean   : 3.5                                                    
##  3rd Qu.: 4.0                                                    
##  Max.   :10.0                                                    
##                                                                  
##        department      salary    
##  sales      :4140   high  :1237  
##  technical  :2720   low   :7316  
##  support    :2229   medium:6446  
##  IT         :1227                
##  product_mng: 902                
##  marketing  : 858                
##  (Other)    :2923

3.3 조건에 맞는 값 할당하기, 수치형 변수를 범주화

## satisfaction_level이 0.5보다 크다면 ‘High’, 크지 않다면 ‘Low’ 부여.   
HR$satisfaction_level_1 = ifelse(HR$satisfaction_level > 0.5, "High", "Low")
HR$satisfaction_level_1 = as.factor(HR$satisfaction_level_1) 
summary(HR$satisfaction_level_1)

##  High   Low 
## 10187  4812

## satisfaction_level이 0.8보다 크다면 ‘High’, 0.5 ~ 0.8이면 ‘Mid’, 나머지는 ‘Low’ 부여
HR$satisfaction_level_2 = ifelse(HR$satisfaction_level > 0.8, "High", 
                                 ifelse(HR$satisfaction_level > 0.5,"Mid","Low")) 
HR$satisfaction_level_2 = as.factor(HR$satisfaction_level_2) 
summary(HR$satisfaction_level_2)

## High  Low  Mid 
## 4002 4812 6185

3.4 조건에 맞는 데이터 추출하기: subset

# salary가 high인 직원들만 추출하여 HR_High라는 새로운 데이터 셋을 생성
HR_high <- subset(HR,salary == "high")

# salary가 high이면서, department가 IT인 직원들만 추출하여 HR_High_IT 생성 (교집합)
HR_high_IT <- subset(HR,salary == 'high' & department == 'IT')

# salary가 high이거나, department가 IT인 직원들을 추출하여 HR_High_IT2 생성 (합집합)
HR_high_IT2 <- subset(HR,salary == "high" | department == "IT")

3.5 집계된 데이터 만들기

library(plyr)
library(kableExtra)

# ddply를 활용한 집계 데이터 만들기

SS <- ddply(HR, .(department,salary),summarise, # department, salary 별로 요약값들을 계산 
         M_SF <- round(mean(satisfaction_level),2), # satisfaction_level의 평균 계산
         COUNT <- length(left), # sales, salary 별로 직원 수 Counting 
         M_WH <- round(mean(average_montly_hours),2)) # average_montly_hours 평균 계산
# round( , 2)는 소수점 2째자리까지 끊어서 표현하는 함수

names(SS)[3:5] <- c("satfisfaction level","left","hours/month")

summary(SS) %>% 
  kbl(align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

department	salary	satfisfaction level	left	hours/month
accounting : 3	high :10	Min. :0.570	Min. : 45	Min. :186
hr : 3	low :10	1st Qu.:0.600	1st Qu.: 185	1st Qu.:200
IT : 3	medium:10	Median :0.615	Median : 358	Median :201
management : 3	NA	Mean :0.617	Mean : 500	Mean :200
marketing : 3	NA	3rd Qu.:0.630	3rd Qu.: 514	3rd Qu.:203
product_mng: 3	NA	Max. :0.670	Max. :2099	Max. :209
(Other) :12	NA	NA	NA	NA

## left값에 따라 오름차순 정렬

SS1 <- arrange(SS,desc(left))
# SS <- SS[order(as.integer(SS$left),decreasing = FALSE), ]
# arrange(SS,left) # 오름차순

## SS1 내용 보기
SS1

##     department salary satfisfaction level left hours/month
## 1        sales    low                0.60 2099       200.4
## 2        sales medium                0.63 1772       201.5
## 3    technical    low                0.59 1372       203.1
## 4    technical medium                0.62 1147       202.2
## 5      support    low                0.59 1146       198.9
## 6      support medium                0.65  942       202.5
## 7           IT    low                0.61  609       201.4
## 8           IT medium                0.62  535       204.3
## 9  product_mng    low                0.62  451       201.1
## 10   marketing    low                0.60  402       204.5
## 11 product_mng medium                0.62  383       199.6
## 12   marketing medium                0.64  376       196.9
## 13       RandD medium                0.62  372       202.9
## 14       RandD    low                0.62  364       198.8
## 15          hr medium                0.58  359       193.9
## 16  accounting    low                0.57  358       199.9
## 17  accounting medium                0.58  335       201.5
## 18          hr    low                0.61  335       202.5
## 19       sales   high                0.65  269       201.2
## 20  management   high                0.65  225       200.2
## 21  management medium                0.60  225       202.7
## 22   technical   high                0.63  201       200.0
## 23  management    low                0.61  180       200.7
## 24     support   high                0.66  141       204.0
## 25          IT   high                0.64   83       194.9
## 26   marketing   high                0.61   80       185.6
## 27  accounting   high                0.61   74       205.9
## 28 product_mng   high                0.61   68       194.6
## 29       RandD   high                0.59   51       199.8
## 30          hr   high                0.67   45       209.1

## kable 표 기능 활용 보기
SS1 %>%
kbl(align = "c") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

department	salary	satfisfaction level	left	hours/month
sales	low	0.60	2099	200.4
sales	medium	0.63	1772	201.5
technical	low	0.59	1372	203.1
technical	medium	0.62	1147	202.2
support	low	0.59	1146	198.9
support	medium	0.65	942	202.5
IT	low	0.61	609	201.4
IT	medium	0.62	535	204.3
product_mng	low	0.62	451	201.1
marketing	low	0.60	402	204.5
product_mng	medium	0.62	383	199.6
marketing	medium	0.64	376	196.9
RandD	medium	0.62	372	202.9
RandD	low	0.62	364	198.8
hr	medium	0.58	359	193.9
accounting	low	0.57	358	199.9
accounting	medium	0.58	335	201.5
hr	low	0.61	335	202.5
sales	high	0.65	269	201.2
management	high	0.65	225	200.2
management	medium	0.60	225	202.7
technical	high	0.63	201	200.0
management	low	0.61	180	200.7
support	high	0.66	141	204.0
IT	high	0.64	83	194.9
marketing	high	0.61	80	185.6
accounting	high	0.61	74	205.9
product_mng	high	0.61	68	194.6
RandD	high	0.59	51	199.8
hr	high	0.67	45	209.1

# 조건부 형식 적용 
SS1$salary <- cell_spec(SS1$salary,
                        color = ifelse(SS1$salary == "low", "red",
                                       ifelse(SS1$salary == "medium", "green","blue")))
SS1$`satfisfaction level` <- cell_spec(SS1$`satfisfaction level`,
                                       color = ifelse(SS1$`satfisfaction level` > 0.63, "blue",
                                                      ifelse(SS1$`satfisfaction level` > 0.6, "green","red")))
SS1$left <- cell_spec(SS1$left, 
                      color = ifelse(SS1$left > 514, "red", 
                                     ifelse(SS1$left > 185.25, "green", "blue")))
SS1$`hours/month` <- cell_spec(SS1$`hours/month`, 
                               color = ifelse(SS1$`hours/month` > 202.62, "red", 
                                              ifelse(SS1$`hours/month` > 199.66, "green","blue")))
SS1[2:5] <- SS1[c("salary","satfisfaction level","left","hours/month")]

SS1 %>%
kbl(escape = F, align = "c") %>%
  kable_paper("striped", full_width = F, position = "center")

department	salary	satfisfaction level	left	hours/month
sales	low	0.6	2099	200.36
sales	medium	0.63	1772	201.52
technical	low	0.59	1372	203.06
technical	medium	0.62	1147	202.25
support	low	0.59	1146	198.9
support	medium	0.65	942	202.54
IT	low	0.61	609	201.38
IT	medium	0.62	535	204.3
product_mng	low	0.62	451	201.05
marketing	low	0.6	402	204.49
product_mng	medium	0.62	383	199.64
marketing	medium	0.64	376	196.87
RandD	medium	0.62	372	202.95
RandD	low	0.62	364	198.75
hr	medium	0.58	359	193.86
accounting	low	0.57	358	199.9
accounting	medium	0.58	335	201.47
hr	low	0.61	335	202.46
sales	high	0.65	269	201.18
management	high	0.65	225	200.25
management	medium	0.6	225	202.65
technical	high	0.63	201	200.04
management	low	0.61	180	200.74
support	high	0.66	141	203.99
IT	high	0.64	83	194.93
marketing	high	0.61	80	185.57
accounting	high	0.61	74	205.91
product_mng	high	0.61	68	194.63
RandD	high	0.59	51	199.75
hr	high	0.67	45	209.07

4 기술 통계

4.1 데이터 요약, 확인, factor로 변경

summary(mtcars$mpg) # r 내장되어 있는 mtcars dataset을 요약합니다.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.4    15.4    19.2    20.1    22.8    33.9

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

mtcars[c(2,8:11)] <- lapply(mtcars[c(2,8:11)],as.factor) # 범주형 데이터를 factor형태로 변환합니다.

4.2 집중화 경향 측도

자료의 중심 위치를 알고자 할 때 주로 사용되는 통계 수치들은 평균, 중앙값, 최빈값으로 평균(mean), 중앙값(median)은 양적 자료일 경우 최빈값(mode)은 주로 질적 자료일 경우 사용.
평균(mean): 대부분의 경우 데이터를 요약할 때는 수집된 자료가 어떤 값을 중심으로 분포하고 있는가를 알기 위해서 평균 주행 거리, 평균 점수, 평균 수명, 평균 수입 등과 같이 산술 평균을 사용.

\[ 평균 \bar{X}는 다음과 같이 계산: \bar{X}=(∑_𝑖^𝑛𝑋_𝑖 )/𝑛 \]

중앙값(median): 중앙값은 자료를 크기 순으로 나열할 때 가운데 위치한 값을 의미. 따라서 중앙값은 이상점(outlier)이라고 볼 수 있는 아주 큰 값이나 아주 작은 값의 영향을 덜 받음.
최빈값(mode): 최빈값이란 자료 중에서 가장 자주 나오는 값을 의미. 주로 질적 자료에서 사용

평균, 5%,10% 양극단 제거 평균, 중앙값, 최빈값

mean(mtcars$mpg) # 평균

## [1] 20.09

mean(mtcars$mpg,trim=0.025,na.rm=T) # 양극단 각 2.5% 제거 후 평균

## [1] 20.09

mean(mtcars$mpg,trim=0.05,na.rm=T) # 양극단 각 5% 제거 후 평균

## [1] 19.95

colMeans(mtcars[c(1,3,4,6,7)]) # 열별 평균

##     mpg    disp      hp      wt    qsec 
##  20.091 230.722 146.688   3.217  17.849

median(mtcars$mpg) # 중앙값

## [1] 19.2

table(mtcars$cyl) # 실린더 개수는 전체 데이터셋에서 4개, 6개, 8개만 존재한다. 빈도 계산

## 
##  4  6  8 
## 11  7 14

sort(table(mtcars$cyl),decreasing = T) # sort함수를 조합하면 빈도순으로 정렬하여 출력한다.

## 
##  8  4  6 
## 14 11  7

library(prettyR)
Mode(mtcars$cyl) # 최빈값

## [1] "8"

4.3 산포 경향 측도

분산(Variance)과 표준편차(Standard deviation): 자료가 평균에서 얼마만큼 떨어져 있느냐를 측정하는 측도로서 가장 대표적으로 쓰이는 방법. 분산은 자료들의 편차 제곱의 평균, 표준편차는 분산의 양의 제곱근으로 정의. 분산 $𝑆^2$, 표준편차 𝑆는 다음과 같이 계산.

\[ 𝑆^2=(∑_𝑖^𝑛(𝑋_𝑖−\bar{X} )^2 )/((𝑛−1) ), 𝑆=\sqrt{𝑆^2 } \]

범위(range): 자료의 산포 정도를 표현하는 간편한 방법 중 하나로, 자료를 크기 순으로 나열할 때 양극단 두 자료 값의 차이를 의미. 범위는 계산하기 간편하지만 이상점이 있는 경우에는 올바른 산포의 측도가 아님.
변동계수(𝑐𝑣 ; coefficient of variation): 평균이 크게 다른 두 개 이상의 집단을 비교 할 때나 각 집단의 상대적 동질성을 비교하고자 할 때 주로 사용. 일반적으로 동일한 형태의 분포를 가진다면 자료의 평균값이 클수록 표준편차의 값도 커지기 때문에 평균값이 크게 다른 집단을 비교할 때 표준편차의 절대값 크기만을 비교하면 오류에 빠질 수 있음. 서로 다른 단위로 측정된 자료들의 산포를 비교할 때 표준편차보다 유용. 변동계수 CV는 다음과 같이 계산.

\[ 𝑐𝑣=𝑆/\bar{X} \]

var(mtcars$mpg) # 분산

## [1] 36.32

sd(mtcars$mpg) # 표준편차

## [1] 6.027

range(mtcars$mpg) # 최소,최대

## [1] 10.4 33.9

max(mtcars$mpg)-min(mtcars$mpg) # 범위= 최대-최소

## [1] 23.5

IQR(mtcars$mpg) # IQR = 3사분위수 - 1사분위수

## [1] 7.375

apply(mtcars[c(1,3,4,6,7)],2,function(x){sd(x)/mean(x)}) # 변동계수 (coefficient of variation)

##    mpg   disp     hp     wt   qsec 
## 0.3000 0.5372 0.4674 0.3041 0.1001

# aplly 옵션 1: rows, 2 : columns, c(1, 2) : rows and columns

lapply(mtcars[c(1,3,4,6,7)],function(x){sd(x)/mean(x)})

## $mpg
## [1] 0.3
## 
## $disp
## [1] 0.5372
## 
## $hp
## [1] 0.4674
## 
## $wt
## [1] 0.3041
## 
## $qsec
## [1] 0.1001

4.4 상관관계 측도

상관분석(correlation)이란 두 개의 계량척도로 측정된 변수들 간의 선형(linear) 관계의 정도를 보여주는 분석입니다.
선형관계란 아래 그림과 같이 두 변수가 같은 방향으로 움직이는 것을 의미합니다.

예를 들어 백화점에서 서비스만족도가 높아지면 재방문의사나 구매금액도 같이 높아지는 지를 파악하려는 경우에 유용한 분석기법입니다.
상관분석을 통해 상관분석의 결과는 상관계수(correlation coefficient)라는 지표로 나타낼 수 있습니다. 상관계수(𝜌)를 구하는 공식은 아래와 같습니다.

\[ 𝜌=𝑆_𝑥𝑦/(𝑆_𝑥 𝑆_𝑦 ) \] \[ =((∑(𝑋_𝑖−\bar{X} ) (𝑌_𝑖−\bar{Y} ))/(𝑛−1))/( \sqrt{𝑆((𝑋_𝑖−\bar{X} )^2/(𝑛−1)) }\sqrt{((𝑌_𝑖−\bar{Y} )^2/(𝑛−1))} ) \]

\[ 𝑆~𝑥:𝑋의 표준편차 , 𝑆~𝑦: 𝑌의 표준편차 , 𝑆~𝑥𝑦:𝑋와 𝑌의 공분산 \]

cov(mtcars$mpg,mtcars$wt) # 공분산

## [1] -5.117

cor(mtcars$wt, mtcars$mpg) # equivalent to cov(wt, mpg)/(sd(wt)*sd(mpg))

## [1] -0.8677

plot(mtcars$mpg,mtcars$wt) # 산포도

pairs(mtcars) # mtcars 변수 간 산포도

4.5 상대적 위치 측도

특정한 자료 값이 주어진 자료 들 중 어떤 위치에 있는지를 나타내 주는 측도. 이러한 상대적 위치의 측도로 흔히 쓰이는 것은 백분위수와 z값(표준화 점수)등이 있음.
백분위수(percentiles): 사분위수의 개념을 더욱 확대하여 크기 순서에 따라 나열한 자료값들을 100 등분하는 수 값을 의미. 즉, 𝑝백분위수는 자료값 중 𝑝%가 그 값보다 작거나 같게 되는값. 1 사분위수, 중앙값, 3분위수는 각각 25, 50, 75백분위수에 해당. 백분위수는 자료의 크기가 30보다 큰 경우에 사용.
𝑧점수(𝑧score) : 특정한 자료값이 평균으로부터 표준편차의 몇 배만큼 떨어져 있는가를 측정. 𝑧점수는 다음과 같이 계산

\[ 𝑧=((𝑋_𝑖−\bar{X} ))/𝑆, 𝑋̅: 평균, 𝑆: 표준편차 \]

# 백분위수
quantile(mtcars$mpg, probs=c(0.1, 0.5, 0.9)) # probs 옵션으로 지정해주면 원하는 퍼센트의 분위수를 구할 수 있다.

##   10%   50%   90% 
## 14.34 19.20 30.09

apply(mtcars[c(1,4,6,7)], 2, median) # apply 함수로 한꺼번에 적용, 문법은 apply(대상값, 칼럼값, 함수, 옵션값)

##     mpg      hp      wt    qsec 
##  19.200 123.000   3.325  17.710

apply(mtcars[c(1,4,6,7)], 2, quantile)

##        mpg    hp    wt  qsec
## 0%   10.40  52.0 1.513 14.50
## 25%  15.43  96.5 2.581 16.89
## 50%  19.20 123.0 3.325 17.71
## 75%  22.80 180.0 3.610 18.90
## 100% 33.90 335.0 5.424 22.90

# z-score
zmpg <- scale(mtcars$mpg) # z-score (데이터 표준화)
zmpg

##           [,1]
##  [1,]  0.15088
##  [2,]  0.15088
##  [3,]  0.44954
##  [4,]  0.21725
##  [5,] -0.23073
##  [6,] -0.33029
##  [7,] -0.96079
##  [8,]  0.71502
##  [9,]  0.44954
## [10,] -0.14777
## [11,] -0.38006
## [12,] -0.61235
## [13,] -0.46302
## [14,] -0.81146
## [15,] -1.60788
## [16,] -1.60788
## [17,] -0.89442
## [18,]  2.04239
## [19,]  1.71055
## [20,]  2.29127
## [21,]  0.23385
## [22,] -0.76168
## [23,] -0.81146
## [24,] -1.12671
## [25,] -0.14777
## [26,]  1.19619
## [27,]  0.98049
## [28,]  1.71055
## [29,] -0.71191
## [30,] -0.06481
## [31,] -0.84464
## [32,]  0.21725
## attr(,"scaled:center")
## [1] 20.09
## attr(,"scaled:scale")
## [1] 6.027

4.6 왜도, 첨도

n <- length(mtcars$mpg)
m <- mean(mtcars$mpg)
sd <- sd(mtcars$m)
hist(mtcars$mpg,breaks = 60) # 히스토그램
skew <- sum((mtcars$mpg-m)^3/sd^3)/n ; skew  # 왜도 > 0 이면 오른쪽에 긴 꼬리, < 0 이면 왼쪽에 긴꼬리, = 0 이면 좌우대칭

## [1] 0.6107

kurt <- sum((mtcars$mpg-m)^4/sd^4)/n-3 ; kurt  # 첨도 > 0 이면 가운데 뾰족, < 0 이면 가운데 완만, = 0 이면 정규분포 형태

## [1] -0.3728

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:prettyR':
## 
##     describe, skew

skew(mtcars$mpg)

## [1] 0.6107

kurtosi(mtcars$mpg)

## [1] -0.3728

4.7 기타 데이터 요약 함수

pastech 패키지의 stat.desc() 함수
pysch 패키지의 describe() 함수

library(pastecs)
stat.desc(mtcars[c(1,3:7)])

##                  mpg       disp        hp      drat       wt     qsec
## nbr.val       32.000    32.0000   32.0000  32.00000  32.0000  32.0000
## nbr.null       0.000     0.0000    0.0000   0.00000   0.0000   0.0000
## nbr.na         0.000     0.0000    0.0000   0.00000   0.0000   0.0000
## min           10.400    71.1000   52.0000   2.76000   1.5130  14.5000
## max           33.900   472.0000  335.0000   4.93000   5.4240  22.9000
## range         23.500   400.9000  283.0000   2.17000   3.9110   8.4000
## sum          642.900  7383.1000 4694.0000 115.09000 102.9520 571.1600
## median        19.200   196.3000  123.0000   3.69500   3.3250  17.7100
## mean          20.091   230.7219  146.6875   3.59656   3.2172  17.8487
## SE.mean        1.065    21.9095   12.1203   0.09452   0.1730   0.3159
## CI.mean.0.95   2.173    44.6847   24.7196   0.19277   0.3528   0.6443
## var           36.324 15360.7998 4700.8669   0.28588   0.9574   3.1932
## std.dev        6.027   123.9387   68.5629   0.53468   0.9785   1.7869
## coef.var       0.300     0.5372    0.4674   0.14866   0.3041   0.1001

library(psych)
describe(mtcars[c(1,3:7)])

##      vars  n   mean     sd median trimmed    mad   min    max  range skew
## mpg     1 32  20.09   6.03  19.20   19.70   5.41 10.40  33.90  23.50 0.61
## disp    2 32 230.72 123.94 196.30  222.52 140.48 71.10 472.00 400.90 0.38
## hp      3 32 146.69  68.56 123.00  141.19  77.10 52.00 335.00 283.00 0.73
## drat    4 32   3.60   0.53   3.70    3.58   0.70  2.76   4.93   2.17 0.27
## wt      5 32   3.22   0.98   3.33    3.15   0.77  1.51   5.42   3.91 0.42
## qsec    6 32  17.85   1.79  17.71   17.83   1.42 14.50  22.90   8.40 0.37
##      kurtosis    se
## mpg     -0.37  1.07
## disp    -1.21 21.91
## hp      -0.14 12.12
## drat    -0.71  0.09
## wt      -0.02  0.17
## qsec     0.34  0.32

5 시각화

5.1 그래프의 중요성; Anscrombe의 예시

영국의 통계학자 Francis Anscombe 가 “Graphs in Statistical Analysis”(1973년)라는 논문에서 왜 통계분석을 할 때 반드시 통계량 뿐만 아니라 그래프 분석을 병행해야 하는지를 보여주는 데이터 예를 듭니다.
(x1, y1), (x2, y2), (x3, y3), (x4, y4) 변수들로 구성된 4개 그룹이 있습니다. x1~x4, y1~y4 끼리 평균, 표준편차가 같고, (x1, y1), (x2, y2), (x3, y3), (x4, y4) 변수들 간의 상관계수와 회귀모형이 같습니다.
이정도면 같은 모집단에서 뽑은 같은 성격,특징,형태를 보이는 4개의 표본이라고 판단하기 쉬운데 그래프를 그려보면 4개의 표본이 너무 다른 분포란 걸 알게 됩니다. R의 base패키지인 datasets 패키지에 ’anscombe’라는 데이터 프레임이 기본 탑재되어 있습니다.

anscombe

##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89

str(anscombe)

## 'data.frame':    11 obs. of  8 variables:
##  $ x1: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x2: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x3: num  10 8 13 9 11 14 6 4 12 7 ...
##  $ x4: num  8 8 8 8 8 8 8 19 8 8 ...
##  $ y1: num  8.04 6.95 7.58 8.81 8.33 ...
##  $ y2: num  9.14 8.14 8.74 8.77 9.26 8.1 6.13 3.1 9.13 7.26 ...
##  $ y3: num  7.46 6.77 12.74 7.11 7.81 ...
##  $ y4: num  6.58 5.76 7.71 8.84 8.47 7.04 5.25 12.5 5.56 7.91 ...

# 변수별 평균, 표준편차
options(digits = 3) # 소수점 자리 설정
sapply(anscombe, mean) # mean

##  x1  x2  x3  x4  y1  y2  y3  y4 
## 9.0 9.0 9.0 9.0 7.5 7.5 7.5 7.5

sapply(anscombe, sd) # standard deviation

##   x1   x2   x3   x4   y1   y2   y3   y4 
## 3.32 3.32 3.32 3.32 2.03 2.03 2.03 2.03

# x, y 상관계수 (x, y correlation)
cor(anscombe$x1,anscombe$y1); cor(anscombe$x2, anscombe$y2); cor(anscombe$x3, anscombe$y3); cor(anscombe$x4, anscombe$y4)

## [1] 0.816

## [1] 0.816

## [1] 0.816

## [1] 0.817

# Simple Linear Regrassions by 4 groups
lm(anscombe$y1 ~ anscombe$x1); lm(anscombe$y2 ~ anscombe$x2); lm(anscombe$y3 ~ anscombe$x3); lm(anscombe$y4 ~ anscombe$x4)

## 
## Call:
## lm(formula = anscombe$y1 ~ anscombe$x1)
## 
## Coefficients:
## (Intercept)  anscombe$x1  
##         3.0          0.5

## 
## Call:
## lm(formula = anscombe$y2 ~ anscombe$x2)
## 
## Coefficients:
## (Intercept)  anscombe$x2  
##         3.0          0.5

## 
## Call:
## lm(formula = anscombe$y3 ~ anscombe$x3)
## 
## Coefficients:
## (Intercept)  anscombe$x3  
##         3.0          0.5

## 
## Call:
## lm(formula = anscombe$y4 ~ anscombe$x4)
## 
## Coefficients:
## (Intercept)  anscombe$x4  
##         3.0          0.5

# Scatter Plot & Simple Linear Regression Line
op <- par(no.readonly = TRUE)
par(mfrow = c(2,2)) # 2 x 2 layout

plot(anscombe$x1, anscombe$y1); abline(lm(anscombe$y1~anscombe$x1), col = "red", lty = 3)
plot(anscombe$x2, anscombe$y2); abline(lm(anscombe$y2~anscombe$x2), col = "red", lty = 3)
plot(anscombe$x3, anscombe$y3); abline(lm(anscombe$y3~anscombe$x3), col = "red", lty = 3)
plot(anscombe$x4, anscombe$y4); abline(lm(anscombe$y4~anscombe$x4), col = "red", lty = 3)

par(op)

5.2 ggplot2 패키지 활성화 및 TitanicSurvival 데이터 확인

ggplot2 패키지 활성화와 TitanicSurvival 데이터를 불러와서 타입을 확인합니다.

library(ggplot2) # 패키지 부착

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(carData) # TitanicSurvival 데이터가 있는 패키지 활성화

str(TitanicSurvival)

## 'data.frame':    1309 obs. of  4 variables:
##  $ survived      : Factor w/ 2 levels "no","yes": 2 2 1 1 1 2 2 1 2 1 ...
##  $ sex           : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age           : num  29 0.917 2 30 25 ...
##  $ passengerClass: Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...

5.3 막대 그래프

막대그래프는 이산형 변수를 집계 내는 그래프로 막대 도표는 이산형 변수를 x축으로 두고, y축은 빈도(Counting)나 양을 표현하는 도표입니다.

# 빈도 시각화
library(gridExtra)

g1 <- ggplot(TitanicSurvival,aes(x=sex)) + geom_bar() # x축에 sex 설정, y축은 설정하지 않음(자동집계; 빈도)
g2 <- ggplot(TitanicSurvival,aes(x=survived)) + geom_bar() 
g3 <- ggplot(TitanicSurvival,aes(x=passengerClass)) + geom_bar() 
g4 <- ggplot(TitanicSurvival,aes(x=age)) + geom_histogram(bins=10) 

grid.arrange(g1,g2,g3,g4,nrow=2,ncol=2)

## Warning: Removed 263 rows containing non-finite values (stat_bin).

5.3.1 막대 도표(Bar plot) – 색 변경

ggplot2에서는 col = , fill = 옵션을 줌으로써 그래프에 색을 더할 수가 있습니다.
점, 선과 처럼 면적이 없는 그래프는 col 옵션을 통해 색을 바꿔주며, 면적이 있는 그래프들은 fill 옵션을 통해 색을 변경해줍니다.

ggplot(TitanicSurvival,aes(x=sex)) + geom_bar(fill = 'royalblue') # royalblue 색 채우기

library(RColorBrewer)
display.brewer.all(colorblindFriendly = TRUE)

brewer.pal(n = 4, name = "Dark2")

## [1] "#1B9E77" "#D95F02" "#7570B3" "#E7298A"

g21 <- ggplot(TitanicSurvival,aes(x=sex)) + geom_bar(fill = "#1B9E77")  
g22 <- ggplot(TitanicSurvival,aes(x=survived)) + geom_bar(fill = "#D95F02") 
g23 <- ggplot(TitanicSurvival,aes(x=passengerClass)) + geom_bar(fill = "#7570B3") 
g24 <- ggplot(TitanicSurvival,aes(x=age)) + geom_histogram(fill = "#E7298A", bins=10) 

grid.arrange(g21,g22,g23,g24,nrow=2,ncol=2)

## Warning: Removed 263 rows containing non-finite values (stat_bin).

5.3.2 Survived 값에 따라 색 채우기

g31 <- ggplot(TitanicSurvival,aes(x=sex)) + geom_bar(aes(fill=survived)) + scale_fill_brewer(palette = "Dark2")  
g32 <- ggplot(TitanicSurvival,aes(x=passengerClass)) + geom_bar(aes(fill=survived)) + scale_fill_brewer(palette = "Dark2")
g33 <- ggplot(TitanicSurvival,aes(x=age)) + geom_histogram(aes(fill=survived),bins=10) + scale_fill_brewer(palette = "Dark2")
g34 <- ggplot(TitanicSurvival,aes(x=survived)) + geom_bar(aes(fill=passengerClass)) + scale_fill_brewer(palette = "Dark2")

grid.arrange(g31,g32,g33,g34,nrow=2,ncol=2)

## Warning: Removed 263 rows containing non-finite values (stat_bin).

5.3.3 막대 도표(Bar plot) - 범례

범례 이름 편집: labs(fill = )을 추가해줌으로써 색 구분 정보를 나타내는 label 제목을 바꿔줄 수 있습니다.

# 생존자 별 클래스 비율 시각화
ggplot(TitanicSurvival,aes(x=survived)) +  
  geom_bar(aes(fill=passengerClass)) +
  labs(fill = "Divided by passengerClass")

# 클래스별 생존자 비율 시각화
ggplot(TitanicSurvival,aes(x=passengerClass)) +  
  geom_bar(aes(fill=survived)) +
  labs(fill = "Divided by survived")

5.3.4 범례 편집,위치 조정 텍스트 사이즈 10으로 조정

g41 <- g31 + labs(fill = "Divided by survived") + theme(text = element_text(size = 10), legend.position = "top")
g42 <- g32 + labs(fill = "Divided by survived") + theme(text = element_text(size = 10), legend.position = "top")
g43 <- g33 + labs(fill = "Divided by survived") + theme(text = element_text(size = 10), legend.position = "top")
g44 <- g34 + labs(fill = "Divided by passengerClass") + theme(text = element_text(size = 10), legend.position = "top")

grid.arrange(g41,g42,g43,g44,nrow=2,ncol=2)

## Warning: Removed 263 rows containing non-finite values (stat_bin).

geom_bar()에는 width=1.8이라고 기재하나 안하나 같은 그래프가 생성되지만, geom_text() 안에서는 반드시 width를 설정해줘야 label이 제대로 표시됨에 유의

ggplot(TitanicSurvival,aes(x=passengerClass, y=..count.., fill=survived)) +  
  geom_bar(position = "dodge", width=0.8) + 
  geom_text(stat = "count", aes(label=..count..),position = position_dodge(width=0.8), vjust=-0.5) + 
  scale_fill_brewer(palette = "Dark2") +
  labs(fill = "Divided by survived") + 
  theme(text = element_text(size = 10), legend.position = "top")

누적막대그래프에서 막대 안쪽 가운데에 레이블을 위치. geom_text() 안의 position 항목을 position_stack()으로 설정. 그 안에서 vjust 값을 0.5로 설정

ggplot(data=TitanicSurvival, mapping = aes(x=passengerClass, y=..count.., fill=as.factor(survived))) + 
  geom_bar(position = "stack", width=0.8) + 
  geom_text(stat = "count", aes(label=..count..),position = position_stack(vjust=0.5)) + 
  scale_fill_brewer(palette = "Dark2") +
  labs(fill = "Divided by survived") + 
  theme(text = element_text(size = 10), legend.position = "top")

5.4 히스토그램

히스토그램은 연속형 변수를 일정 범위로 구간을 만들어 x축으로 설정, y축은 빈도(Counting)를 나타내는 그래프입니다.

# 구간 수정 및 색 입히기
ggplot(TitanicSurvival,aes(x=age)) +
  geom_histogram(binwidth = 5,col='red',fill='royalblue')

## Warning: Removed 263 rows containing non-finite values (stat_bin).

# col은 테두리 색을, fill은 채워지는 색을 바꿔줍니다.

5.5 밀도 그래프 (Density Plot)

연속형 변수를 집계 내는 그래프: 밀도그래프는 연속형 변수를 일정 범위로 구간을 만들어, x축으로 설정하고 y축은 밀도(density)를 나타내는 그래프입니다.

ggplot(TitanicSurvival,aes(x=age)) + geom_density() # 기본

## Warning: Removed 263 rows containing non-finite values (stat_density).

ggplot(TitanicSurvival,aes(x=age))+
  geom_density(col='red',fill='royalblue') # col은 테두리, fill은 채우기

## Warning: Removed 263 rows containing non-finite values (stat_density).

5.6 박스 플롯(Boxplot)

Box and Whisker(상자 수염)는 이산형 변수에 따라 연속형 변수의 분포 차이를 표현해주는 그래프로 데이터 탐색과정(EDA) 에서 매우 중요하게 쓰입니다.

str(mtcars)

## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : Factor w/ 3 levels "4","6","8": 2 2 1 2 3 2 3 1 1 2 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 1 2 2 2 ...
##  $ am  : Factor w/ 2 levels "0","1": 2 2 2 1 1 1 1 1 1 1 ...
##  $ gear: Factor w/ 3 levels "3","4","5": 2 2 2 1 1 1 1 2 2 2 ...
##  $ carb: Factor w/ 6 levels "1","2","3","4",..: 4 4 1 1 2 1 4 2 2 4 ...

mtcars$cyl<-as.factor(mtcars$cyl)
ggplot(mtcars,aes(x=cyl, y=mpg, group=cyl)) +
  geom_boxplot(aes(fill = cyl))  + xlab("엔진형태") + ylab("연비") + ggtitle("Boxplot") + labs(fill = "기통")

5.6.1 박스 플롯(Boxplot) – 아웃라이어 표시

# 투명도 조절 , 아웃라이어 표시
ggplot(mtcars,aes(x=cyl, y=mpg, group=cyl)) +
  geom_boxplot(aes(fill = cyl),alpha = I(0.4),outlier.colour = 'red') + xlab("엔진형태") + ylab("연비") + ggtitle("Boxplot") + labs(fill = "기통")

5.7 산점도(Scatter Plot)

두 연속형 변수의 상관관계를 표현해주는 2차원 그래프, 모델링 전에 변수들 간의 관계를 파악하는데 있어 효과적입니다.

# scatter plot
h <- ggplot(mtcars,aes(x=qsec,y=mpg))+
  geom_point()
h

5.7.1 산점도(Scatter Plot); 점 색상 변경

# Change color of points
library(gridExtra)
h1 <- h + geom_point(color="red")    #모든 point를 red로
h2 <- h+ geom_point(aes(color=wt))    #mtc 데이터에서 wt는 continuous variable
h3 <- h + geom_point(aes(color=factor(am)))    #am은 1 또는 0로 factor로 설정. continuous variable 아님.
grid.arrange(h1,h2,h3,nrow=3,ncol=1)

5.7.2 산점도(Scatter Plot); 기본 색상 변경

# Change default colors in color scale
h + geom_point(aes(color=factor(am))) + scale_color_manual(values = c("orange", "purple"))

5.7.3 산점도(Scatter Plot); 기본 점 모양 변경

# 색과 마찬가지로 기본으로 설정된 모양 역시 바꿀 수 있다
h + geom_point(aes(shape = factor(am))) + scale_shape_manual(values=c(0,2))

5.7.4 산점도(Scatter Plot); 점 모양, 크기 변경

# change points

h11 <- h + geom_point(size=5)
h12 <- h + geom_point(aes(size=wt))
h13 <- h + geom_point(aes(shape=factor(am)))
grid.arrange(h11, h12, h13, ncol=1)

5.7.5 산점도(Scatter Plot); 선 추가

# Add lines to scatterplot
h21 <- h + geom_point(color="blue") + geom_line()    #각 point를 선으로 연결
h22 <- h + geom_point(color="red") + geom_smooth(method="lm",se=TRUE)    #regression line을 추가(s.e.도 보여줌)
h23 <- h + geom_point() + geom_vline(xintercept=18,color="red")    #vertical line 추가

grid.arrange(h21,h22,h23,ncol=1)

## `geom_smooth()` using formula 'y ~ x'

5.7.6 산점도(Scatter Plot); 선 굵기, 타입 변경

# line plot을 그릴수도 있으며 line의 크기와 색 또한 바꿀 수 있다.
h24 <- h + geom_point() + geom_line(size=0.9,aes(color=factor(vs)))
h25 <- h + geom_point() + geom_line(size=1.5,aes(color=factor(vs)),linetype="dotted")+ scale_color_manual(values = c("orange", "purple"))
 
grid.arrange(h24,h25,ncol=1)

5.7.7 산점도(Scatter Plot); 축 레이블 변경

# Change axis labels
h26 <- h1 + labs(x="1/4마일 도달 시간",y="Miles per Gallon")    #x,y축 label 추가
h27 <- h1 + theme(axis.title.x=element_text(face="bold",size=20))+labs(x="1/4마일 도달 시간",y="Miles per Gallon")    #x축 label의 특성 추가
h28 <- h1 + scale_x_continuous("1/4마일 도달 시간",limits=c(10,25),breaks=seq(10,25,5))    #x축 label과 x축의 범위를 바꿈

grid.arrange(h26,h27,h28,ncol=1)

5.7.8 선택 가능 점, 선 형태

5.8 화면 분할

여러 개의 그래프를 하나에 그리기(Combining Graphs)
R에서는 여러 개의 그래프를 합쳐서 하나로 그리는 것은 par() 또는 layout() 함수를 사용합니다.

5.8.1 par()함수

mfrow=c(nrows,ncols)를 사용하여 nrows*ncols개의 plot으로 분할하는데 그림이 그려지는 순서는 열(row)에 의해 채워집니다.
mfcol=c(nrows,ncols)를 사용해도 비슷한데 이때는 행(column)을 기준으로 채워집니다.

5.8.2 화면 분할 (2×2)

네개의 그래프를 2*2로 배열하기
예를 들어 다음의 코드는 네개의 plot을 만들고 두열과 두행으로 배열합니다. 내장 데이타인 mtcars를 사용합니다.

opar <- par(no.readonly = TRUE)  #현재 상태를 opar에 저장
par(mfrow = c(2, 2))  # 화면을 2*2로 나눈다
plot(mtcars$wt, mtcars$mpg, main = "Scatterplot of wt vs. mpg")
plot(mtcars$wt, mtcars$disp, main = "Scatterplot of wt vs disp")
hist(mtcars$wt, main = "Histogram of wt")
boxplot(mtcars$wt, main = "Boxplot of wt")

par(opar) # 원상태 복귀

5.8.3 화면 분할 (3×1)

두번째 예는 세개의 그래프를 세 행과 한 열로 배열해 봅니다.

opar <- par(no.readonly = TRUE)
par(mfrow = c(3, 1))
hist(mtcars$wt)
hist(mtcars$mpg, ann=F)
hist(mtcars$disp, main= "")

# 고수준 함수인 hist()는 디폴트 타이틀이 포함되는데 제목을 나오지 않게 하려면 main= ""을 사용하거나 ann=FALSE를 사용하여 제목과 label 을 나오지 않게 할 수 있습니다.
par(opar)

5.8.4 layout 화면 분할 (1,1,2,3)

layout함수는 layout(mat)와 같은 형식으로 사용하는데 mat는 matrix object로써 여러 plot의 위치를 나타냅니다.
다음 코드에서는 하나의 그림이 1열에 배치되고 두개의 그림은 2열에 배치됩니다.

layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = T))
hist(mtcars$wt)
hist(mtcars$mpg)
hist(mtcars$disp)

5.8.5 세밀한 그림 크기의 조절

각 그림의 크기를 보다 정밀하게 조절하고 싶다면 layout() 함수의 옵션인 width= 와 height= 옵션을 사용할 수 있습니다.
width=각 행의 넓이의 값을 갖는 벡터
height= 각 열의 높이의 값을 갖는 벡터
상대적인 width는 숫자로 표현되며 절대적인 width는(cm) 1cm()함수를 사용하여 표시합니다.
다음의 코드에서 한개의 그림은 1열에 두개의 그림은 2열에 배치되지만 1열의 그림의 높이는 2열 높이의 1/3이며 아래열우측의 그림의 넓이는 아래열좌측 그림 넓이의 1/4이 됩니다.

layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE), 
       widths = c(3, 1), heights = c(1, 2))
hist(mtcars$wt)
hist(mtcars$mpg)
hist(mtcars$disp)

6 데이터 전처리

6.1 IMDB Movi eData 불러오기

IMDB 데이터를 불러오고 데이터 구조, 형식, 상위 1개 데이터 확인하고 기술통계도 확인합니다.

IMDB=read.csv("IMDB-Movie-Data.csv")
# 데이터 구조 파악
str(IMDB)

## 'data.frame':    1000 obs. of  12 variables:
##  $ Rank              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title             : chr  "Guardians of the Galaxy" "Prometheus" "Split" "Sing" ...
##  $ Genre             : chr  "Action,Adventure,Sci-Fi" "Adventure,Mystery,Sci-Fi" "Horror,Thriller" "Animation,Comedy,Family" ...
##  $ Description       : chr  "A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control "| __truncated__ "Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize the"| __truncated__ "Three girls are kidnapped by a man with a diagnosed 23 distinct personalities. They must try to escape before t"| __truncated__ "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing compe"| __truncated__ ...
##  $ Director          : chr  "James Gunn" "Ridley Scott" "M. Night Shyamalan" "Christophe Lourdelet" ...
##  $ Actors            : chr  "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana" "Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron" "James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula" "Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson" ...
##  $ Year              : int  2014 2012 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Runtime..Minutes. : int  121 124 117 108 123 103 128 89 141 116 ...
##  $ Rating            : num  8.1 7 7.3 7.2 6.2 6.1 8.3 6.4 7.1 7 ...
##  $ Votes             : int  757074 485820 157606 60545 393727 56036 258682 2490 7188 192177 ...
##  $ Revenue..Millions.: num  333 126 138 270 325 ...
##  $ Metascore         : int  76 65 62 59 40 42 93 71 78 41 ...

head(IMDB, 1)

##   Rank                   Title                   Genre
## 1    1 Guardians of the Galaxy Action,Adventure,Sci-Fi
##                                                                                                                       Description
## 1 A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.
##     Director                                               Actors Year
## 1 James Gunn Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana 2014
##   Runtime..Minutes. Rating  Votes Revenue..Millions. Metascore
## 1               121    8.1 757074                333        76

summary(IMDB)

##       Rank         Title              Genre           Description       
##  Min.   :   1   Length:1000        Length:1000        Length:1000       
##  1st Qu.: 251   Class :character   Class :character   Class :character  
##  Median : 500   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 500                                                           
##  3rd Qu.: 750                                                           
##  Max.   :1000                                                           
##                                                                         
##    Director            Actors               Year      Runtime..Minutes.
##  Length:1000        Length:1000        Min.   :2006   Min.   : 66      
##  Class :character   Class :character   1st Qu.:2010   1st Qu.:100      
##  Mode  :character   Mode  :character   Median :2014   Median :111      
##                                        Mean   :2013   Mean   :113      
##                                        3rd Qu.:2016   3rd Qu.:123      
##                                        Max.   :2016   Max.   :191      
##                                                                        
##      Rating         Votes         Revenue..Millions.   Metascore    
##  Min.   :1.90   Min.   :     61   Min.   :  0        Min.   : 11.0  
##  1st Qu.:6.20   1st Qu.:  36309   1st Qu.: 13        1st Qu.: 47.0  
##  Median :6.80   Median : 110799   Median : 48        Median : 59.5  
##  Mean   :6.72   Mean   : 169808   Mean   : 83        Mean   : 59.0  
##  3rd Qu.:7.40   3rd Qu.: 239910   3rd Qu.:114        3rd Qu.: 72.0  
##  Max.   :9.00   Max.   :1791916   Max.   :937        Max.   :100.0  
##                                   NA's   :128        NA's   :64

변수 설명

Rank: 순위
Title : 영화 제목
Genre : 영화 장르
Description : 영화 설명
Director : 감독명
Actors : 배우
Year : 영화 상영 년도
Runtime_Minutes : 상영시간
Rating : Rating 점수
Votes : 관객 수
Revenue..Millions : 수익
Metascore : 메타 스코어

6.2 결측치 처리

결측치(Missing Value)는 데이터에 값이 없는 것을 뜻합니다. 줄여서 ’NA’라고 표현하기도 하고, Null 이란 표현도 씁니다.
결측치는 데이터를 분석하는데에 있어서 방해가 되는 존재입니다.

결측치는 다음의 문제를 야기합니다.

결측치를 다 제거할 경우, 막대한 데이터 손실일 수 있습니다.
결측치를 잘못 대체할 경우, 데이터에서 편향(bias)이 생길 수 있습니다.
결측치를 처리하는 데에 있어 분석가의 견해가 가장 많이 반영되는데 이 때문에 잘못된 분석결과가 나올 수 있습니다.

결측치를 자세히 처리하기 위해서는 시간이 많이 투자되어야 합니다. 데이터에 기반한 결측치 처리가 진행되어야 분석을 정확하게 진행할 수 있습니다.

6.3 결측치 대체

raw 데이터를 변경할 때 기존 데이터는 보존하고 새로운 데이터 이름으로 변경하는 것이 바람직합니다.

때로는 데이터 손실을 줄이기 위해 합리적으로 결측치를 대체해야 할 필요가 있습니다. 기본적인 방법은 다음과 같습니다.

연속형 변수 : 평균으로 대체
이산형 변수 : 최빈값으로 대체

결측치를 대체할 때는 항상 다음의 사항들을 확인해야 됩니다.

결측치의 비율
데이터의 분포
다른 변수와의 관계가 있는지

‘정규분포’가 아니면 평균보다는 중위수가 안전합니다. 중위수는 극단값(Outlier)에 영향을 받지 않기 때문입니다.
다른 변수들과의 관계를 활용해서 대체하는 것은 많은 시간을 요구하지만, 더 정교한 분석이 가능합니다.

6.4 IMDB 데이터 결측치 확인 및 대체

Metascore변수에서의 결측치 갯수와 전체 변수별 결측치 갯수를 계산합니다.

options(scipen = 100, digits = 4)
# is.na(IMDB$Metascore) # Metascore 변수 내에서 결측치 논리문 판단(True, False)
sum(is.na(IMDB$Metascore)) # Metascore 변수 내에 결측치 갯수

## [1] 64

colSums(is.na(IMDB)) # IMDB 내 모든 변수별 결측치 개수

##               Rank              Title              Genre        Description 
##                  0                  0                  0                  0 
##           Director             Actors               Year  Runtime..Minutes. 
##                  0                  0                  0                  0 
##             Rating              Votes Revenue..Millions.          Metascore 
##                  0                  0                128                 64

mean(IMDB$Metascore,na.rm=T) # Metascore 변수 내에 결측치 제거 시 평균

## [1] 58.99

mean(IMDB$Revenue..Millions.,na.rm = TRUE)

## [1] 82.96

Revenue..Millions.와 Metascore 변수에만 각각 128, 64개의 결측치가 있습니다.
그리고 Metascore의 결측치를 평균값으로 대체하기 위해 결측치를 제거한 평균값을 구한 결과 58.99입니다.

이제 Metascore, Revenue..Millions.의 결측치를 각각의 평균인 58.99, 82.96으로 지정하겠습니다.

IMDB$Metascore2=IMDB$Metascore  # 원본을 유지하기 위해 결측치 변경할 변수명을 바꿉니다.
IMDB$Metascore2[is.na(IMDB$Metascore2)]=58.99 

IMDB$Revenue..Millions.2=IMDB$Revenue..Millions.  # 원본을 유지하기 위해 결측치 변경할 변수명을 바꿉니다.
IMDB$Revenue..Millions.2[is.na(IMDB$Revenue..Millions.2)]=82.96

결측치를 없앨 수도 있습니다. 다음의 방법을 참조하세요.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following objects are masked from 'package:pastecs':
## 
##     first, last

## The following object is masked from 'package:kableExtra':
## 
##     group_rows

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

IMDB_NArm1 <- IMDB %>% filter(!is.na(IMDB$Revenue..Millions.)) # Revenue..Millions.변수가 결측치일 때 제거
IMDB_NArm2 <- IMDB %>% filter(!is.na(IMDB$Revenue..Millions. & IMDB$Metascore)) 
# Revenue..Millions., Metascore 중 하나라도 결측치일 때 제거
IMDB_NArm <- na.omit(IMDB) # 결측치 모두 제거

str(IMDB_NArm1)

## 'data.frame':    872 obs. of  14 variables:
##  $ Rank               : int  1 2 3 4 5 6 7 9 10 11 ...
##  $ Title              : chr  "Guardians of the Galaxy" "Prometheus" "Split" "Sing" ...
##  $ Genre              : chr  "Action,Adventure,Sci-Fi" "Adventure,Mystery,Sci-Fi" "Horror,Thriller" "Animation,Comedy,Family" ...
##  $ Description        : chr  "A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control "| __truncated__ "Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize the"| __truncated__ "Three girls are kidnapped by a man with a diagnosed 23 distinct personalities. They must try to escape before t"| __truncated__ "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing compe"| __truncated__ ...
##  $ Director           : chr  "James Gunn" "Ridley Scott" "M. Night Shyamalan" "Christophe Lourdelet" ...
##  $ Actors             : chr  "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana" "Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron" "James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula" "Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson" ...
##  $ Year               : int  2014 2012 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Runtime..Minutes.  : int  121 124 117 108 123 103 128 141 116 133 ...
##  $ Rating             : num  8.1 7 7.3 7.2 6.2 6.1 8.3 7.1 7 7.5 ...
##  $ Votes              : int  757074 485820 157606 60545 393727 56036 258682 7188 192177 232072 ...
##  $ Revenue..Millions. : num  333 126 138 270 325 ...
##  $ Metascore          : int  76 65 62 59 40 42 93 78 41 66 ...
##  $ Metascore2         : num  76 65 62 59 40 42 93 78 41 66 ...
##  $ Revenue..Millions.2: num  333 126 138 270 325 ...

str(IMDB_NArm2)

## 'data.frame':    838 obs. of  14 variables:
##  $ Rank               : int  1 2 3 4 5 6 7 9 10 11 ...
##  $ Title              : chr  "Guardians of the Galaxy" "Prometheus" "Split" "Sing" ...
##  $ Genre              : chr  "Action,Adventure,Sci-Fi" "Adventure,Mystery,Sci-Fi" "Horror,Thriller" "Animation,Comedy,Family" ...
##  $ Description        : chr  "A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control "| __truncated__ "Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize the"| __truncated__ "Three girls are kidnapped by a man with a diagnosed 23 distinct personalities. They must try to escape before t"| __truncated__ "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing compe"| __truncated__ ...
##  $ Director           : chr  "James Gunn" "Ridley Scott" "M. Night Shyamalan" "Christophe Lourdelet" ...
##  $ Actors             : chr  "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana" "Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron" "James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula" "Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson" ...
##  $ Year               : int  2014 2012 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Runtime..Minutes.  : int  121 124 117 108 123 103 128 141 116 133 ...
##  $ Rating             : num  8.1 7 7.3 7.2 6.2 6.1 8.3 7.1 7 7.5 ...
##  $ Votes              : int  757074 485820 157606 60545 393727 56036 258682 7188 192177 232072 ...
##  $ Revenue..Millions. : num  333 126 138 270 325 ...
##  $ Metascore          : int  76 65 62 59 40 42 93 78 41 66 ...
##  $ Metascore2         : num  76 65 62 59 40 42 93 78 41 66 ...
##  $ Revenue..Millions.2: num  333 126 138 270 325 ...

str(IMDB_NArm)

## 'data.frame':    838 obs. of  14 variables:
##  $ Rank               : int  1 2 3 4 5 6 7 9 10 11 ...
##  $ Title              : chr  "Guardians of the Galaxy" "Prometheus" "Split" "Sing" ...
##  $ Genre              : chr  "Action,Adventure,Sci-Fi" "Adventure,Mystery,Sci-Fi" "Horror,Thriller" "Animation,Comedy,Family" ...
##  $ Description        : chr  "A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control "| __truncated__ "Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize the"| __truncated__ "Three girls are kidnapped by a man with a diagnosed 23 distinct personalities. They must try to escape before t"| __truncated__ "In a city of humanoid animals, a hustling theater impresario's attempt to save his theater with a singing compe"| __truncated__ ...
##  $ Director           : chr  "James Gunn" "Ridley Scott" "M. Night Shyamalan" "Christophe Lourdelet" ...
##  $ Actors             : chr  "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana" "Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron" "James McAvoy, Anya Taylor-Joy, Haley Lu Richardson, Jessica Sula" "Matthew McConaughey,Reese Witherspoon, Seth MacFarlane, Scarlett Johansson" ...
##  $ Year               : int  2014 2012 2016 2016 2016 2016 2016 2016 2016 2016 ...
##  $ Runtime..Minutes.  : int  121 124 117 108 123 103 128 141 116 133 ...
##  $ Rating             : num  8.1 7 7.3 7.2 6.2 6.1 8.3 7.1 7 7.5 ...
##  $ Votes              : int  757074 485820 157606 60545 393727 56036 258682 7188 192177 232072 ...
##  $ Revenue..Millions. : num  333 126 138 270 325 ...
##  $ Metascore          : int  76 65 62 59 40 42 93 78 41 66 ...
##  $ Metascore2         : num  76 65 62 59 40 42 93 78 41 66 ...
##  $ Revenue..Millions.2: num  333 126 138 270 325 ...
##  - attr(*, "na.action")= 'omit' Named int [1:162] 8 23 26 27 28 40 43 48 50 62 ...
##   ..- attr(*, "names")= chr [1:162] "8" "23" "26" "27" ...

IMDB_NARM2, IMDB_NARM은 결과적으로 동일한 데이터가 된 것을 알 수 있습니다.

6.5 중복데이터 제거

중복 데이터 사례용 데이터를 생성합니다.

a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c(1,1,2,4,1,1,2,2)
df <-data.frame(a,b); df

##   a b
## 1 A 1
## 2 A 1
## 3 A 2
## 4 B 4
## 5 B 1
## 6 B 1
## 7 C 2
## 8 C 2

중복 확인 및 필터링은 다음과 같이 합니다.

duplicated(df) # 중복 여부 확인

## [1] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE

df[duplicated(df), ] # 중복된 데이터만

##   a b
## 2 A 1
## 6 B 1
## 8 C 2

df[!duplicated(df), ]  # 중복 안된 데이터만

##   a b
## 1 A 1
## 3 A 2
## 4 B 4
## 5 B 1
## 7 C 2

unique(df) # 중복 안된 데이터만

##   a b
## 1 A 1
## 3 A 2
## 4 B 4
## 5 B 1
## 7 C 2

df[!duplicated(df$a), ]

##   a b
## 1 A 1
## 4 B 4
## 7 C 2

# a값이 같은 데이터 중복 제거, 이름 같은 행은 맨 처음 행만 남기고 제거하므로 남기고 싶은 것을 첫 행으로 정렬한 후 시행해야 함