Decision Tree

의사결정나무(Decision Tree)

반응변수(2개 범주) 의사결정나무 모형

의사결정나무 모형이란 특정 항목에 대한 의사결정규칙을 나무형태로 분류해 나가는 분석 기법
조건문 형식을 가지는 것으로 조건에 맞는지 여부에 따라 가지를 반복 분할하면서 모델을 만듬
분류 조건이 명료할수록 유용하며 전문가 프로그램에서 많이 사용
결과를 해석하고 이해하기 쉬움
뿌리 마디 : 나무 구조가 시작되는 노드로, 분석 대상의 모든 데이터로 구성됨
3 자식 마디 : 하나의 노드로부터 분기되어 나간 두 개 이상의 노드로, 분석 대상 데이터는 노드 특성에 따라 분리됨
부모 마디 : 자식 마디의 상위 노드
끝 마디 : 각 나뭇가지의 끝에 있는 노드로, 나무 모형에서 분류의 규칙은 끝마디의 개수만큼 생성됨

패키지 불러오기

library(tree)
library(Epi)

파일 불러오기

autoparts <- read.csv("C:/Users/user/Desktop/JBTP/autoparts.csv", header=T)
dim(autoparts)

## [1] 34139    17

autoparts1 <- autoparts[autoparts$prod_no=="90784-76001", -c(1:7)]
autoparts2 <- autoparts1[autoparts1$c_thickness < 1000, ]
autoparts2$y_faulty <- ifelse((autoparts2$c_thickness<20)|(autoparts2$c_thickness>32),1,0)

데이터(훈련, 테스트) 나누기

t_index <- sample(1:nrow(autoparts2), size=nrow(autoparts2)*0.7)
train <- autoparts2[t_index, ]
test <- autoparts2[-t_index, ]

모델 생성

m <- tree(factor(y_faulty) ~ fix_time+a_speed+b_speed+separation+s_separation+
             rate_terms+mpa+load_time+highpressure_time, data=train)
plot(m)
text(m)

***

가지 줄이기

# 최적의 가지치기 설정을 위한 분석(그림을 보고 결정)
prune.m <- prune.tree(m, method = "misclass")
plot(prune.m)          # 가지가 너무 깊으면 과적합

prune.m <- prune.tree(m, best=4)  # 여기에서는 가지 4가 적합
plot(prune.m)
text(m)

***

ROC

# ROC, AUC는 연속형에 대해서는 구할 수 없음
# 이항과 다항 반응변수에 대해서 구함

# Cross-validation for Choosing Tree Complexity
# 나무의 크기가 12 또는 11일 때 deviance(오분석)가 1175로 가장 작다
# 나무의 가지 크기를 11로 하기로 한다
cv.tree(m, FUN = prune.misclass)

## $size
## [1] 12 11  9  5  4  3  1
## 
## $dev
## [1] 1280 1280 1427 1427 1443 1536 1996
## 
## $k
## [1]   -Inf   0.00  27.50  27.75  57.00 113.00 235.50
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

#검증데이터로 예측값 구하기
prune.m <- prune.misclass(m, best=4)  # prune.m <- prune.tree(m, method = "misclass")
yhat_test <- predict(prune.m, test, type="class")
table <- table(real=test$y_faulty, predict=yhat_test);table

##     predict
## real    0    1
##    0 5504  181
##    1  398  448

#정확도
(table[1,1]+table[2,2])/sum(table)

## [1] 0.9113459

#ROC
ROC(test=yhat_test, stat=test$y_faulty, plot="ROC", AUC=T, main="Decision Tree")

***

반응변수(다범주) 의사결정나무 모형

데이터 준비

# 다범주 데이터 준비
autoparts2$g_class <- as.factor(ifelse(autoparts2$c_thickness < 20, 1, 
                                       ifelse(autoparts2$c_thickness < 32, 2, 3)))
# train, test data 나누기
t_index <- sample(1:nrow(autoparts2), size=nrow(autoparts2)*0.7)
train <- autoparts2[t_index, ]
test <- autoparts2[-t_index, ]

모델 생성

m <- tree(g_class ~ fix_time+a_speed+b_speed+separation+s_separation+
            rate_terms+mpa+load_time+highpressure_time, data=train)
plot(m)
text(m)

***

검증데이터 예측하기

yhat_test <- predict(m, test, type="class")
table <- table(real=test$g_class, predict=yhat_test);table

##     predict
## real    1    2    3
##    1  376  261    2
##    2  186 5359  124
##    3    0    9  214

정확도

(table[1,1]+table[2,2]+table[3,3])/sum(table)

## [1] 0.9108865

연속형 반응변수에 대한 의사결정나무 모형

회귀분석을 쓰지 굳이 쓸 이유가 있을까? 데이터 탐색 목적 정도로 활용

모델 생성

m <- tree(c_thickness ~ fix_time+a_speed+b_speed+separation+s_separation+
            rate_terms+mpa+load_time+highpressure_time, data=train)
plot(m)
text(m)

***

검증데이터 예측하기

yhat_test <- predict(m, test)
plot(x=test$c_thickness, y=yhat_test, main="Decision Tree")

mse <- mean((yhat_test - test$c_thickness)^2)
mse

## [1] 4.642822

Decision Tree

updragon

2018년 8월 6일

의사결정나무(Decision Tree)

반응변수(2개 범주) 의사결정나무 모형

의사결정나무 모형이란 특정 항목에 대한 의사결정규칙을 나무형태로 분류해 나가는 분석 기법

조건문 형식을 가지는 것으로 조건에 맞는지 여부에 따라 가지를 반복 분할하면서 모델을 만듬

분류 조건이 명료할수록 유용하며 전문가 프로그램에서 많이 사용

결과를 해석하고 이해하기 쉬움

3 자식 마디 : 하나의 노드로부터 분기되어 나간 두 개 이상의 노드로, 분석 대상 데이터는 노드 특성에 따라 분리됨

패키지 불러오기

파일 불러오기

데이터(훈련, 테스트) 나누기

모델 생성

가지 줄이기

ROC

반응변수(다범주) 의사결정나무 모형

데이터 준비

모델 생성

검증데이터 예측하기

정확도

연속형 반응변수에 대한 의사결정나무 모형

회귀분석을 쓰지 굳이 쓸 이유가 있을까? 데이터 탐색 목적 정도로 활용

모델 생성

검증데이터 예측하기