지도학습 알고리즘 2 - Decision Tree, KNN, 신경망

2. 지도학습 알고리즘 (2)

2.1 의사결정나무 (Decision Tree)

구조 : 뿌리마디 + 자식마디 + 부모마디 + 끝마디
의사 결정나무의 학습은 학습에 사용되는 데이터 집합을 적절한 분할 기준 또는 분할 테스트에 따라 나누는 과정
순활 분할 방식으로서 더 이상 순수도를 높일 수 없거나, 말단 노드에 포함된 개체의 수가 사전에 정한 최솟값에 도달하였거나, 노드의 깊이가 사전에 정해놓은 한계에 이를 때까지 재귀적으로 분할이 반복된다.
이러한 방식을 하향식 결정트리 귀납버이라 한다.
가지치기의 기준은 순수도를 가장 높여줄 수 있는 변수를 먼저 선택해 진행한다.
순수도를 측정하는 척도로는 지니 척도, 정보이익=엔트로피이 많이 사용된다.

지니척도

두 번을 복원 추출 했을 때, 동일 범주 개체가 선택될 확률이다.
순수도가 높을수록 1에, 균등할수록 0.5에 가까워 진다.
순수도가 높은 변수(불순도가 낮은 변수)로 가지치기를 한다.
순수도 = 1- 불순도

엔트로피

두 범주의 확률이 모두 0.5면 엔트로피는 1이 된다
만약 한 범주만 있는 완전히 순수한 상태라면 엔트로피는 0이 된다.
정보 이익은 분리 전과 분리 후의 엔트로피 차이를 계산해 얻는다.
에프터 엔트로피가 작아져야 (더 순수한 상태) 정보 이익이 커진다.

2.1.1 데이터 준비

autoparts=read.csv("data/autoparts.csv",header = TRUE)
autoparts1=autoparts[autoparts$prod_no=="90784-76001",c(2:11)]
autoparts2=autoparts1[autoparts1$c_thickness<1000,]

autoparts2$y_faulty=ifelse((autoparts2$c_thickness<20)|(autoparts2$c_thickness>32),1,0)

t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7) # 행 인덱스 추출 (70%)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]

2.1.2 모델 생성

m=tree(factor(y_faulty)~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=train)

plot(m)
text(m)

2.1.3 가지치기

가지가 많을수록 정답률은 높아지나, 모델은 복잡해진다.

prune.m=prune.tree(m,method = "misclass") 
# 잘못된 분류를 기준으로 가지치기한다. 
# 기본값으로 두어도 큰 차이는 없다. 

plot(prune.m)

가지 9개

prune.m=prune.tree(m,best=9)
plot(prune.m)
text(prune.m)

가지 3개

prune.m2=prune.tree(m,best=3)
plot(prune.m2)
text(prune.m2)

2.1.4 검증 데이터 예측하기

yhat_test=predict(prune.m,test,type="class")
yhat_test2=predict(prune.m2,test,type="class")

가지 9개

confusionMatrix(yhat_test,as.factor(test$y_faulty))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5387  258
##          1  315  571
##                                          
##                Accuracy : 0.9123         
##                  95% CI : (0.9051, 0.919)
##     No Information Rate : 0.8731         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.6155         
##  Mcnemar's Test P-Value : 0.01931        
##                                          
##             Sensitivity : 0.9448         
##             Specificity : 0.6888         
##          Pos Pred Value : 0.9543         
##          Neg Pred Value : 0.6445         
##              Prevalence : 0.8731         
##          Detection Rate : 0.8248         
##    Detection Prevalence : 0.8643         
##       Balanced Accuracy : 0.8168         
##                                          
##        'Positive' Class : 0              
##

가지 3개

confusionMatrix(yhat_test2,as.factor(test$y_faulty))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5574  675
##          1  128  154
##                                           
##                Accuracy : 0.877           
##                  95% CI : (0.8688, 0.8849)
##     No Information Rate : 0.8731          
##     P-Value [Acc > NIR] : 0.1717          
##                                           
##                   Kappa : 0.2274          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9776          
##             Specificity : 0.1858          
##          Pos Pred Value : 0.8920          
##          Neg Pred Value : 0.5461          
##              Prevalence : 0.8731          
##          Detection Rate : 0.8535          
##    Detection Prevalence : 0.9568          
##       Balanced Accuracy : 0.5817          
##                                           
##        'Positive' Class : 0               
##

2.1.5 ROC, AUC

가지 9개

ROC(test=yhat_test,stat=test$y_faulty,plot="ROC",AUC=T,main="Tree")

가지 3개

ROC(test=yhat_test2,stat=test$y_faulty,plot="ROC",AUC=T,main="Tree")

2.1.6 예측 (1) 한개의 데이터 예측

new.data=data.frame(fix_time=87,a_speed=0.609,b_speed=1.715,separation=242.7,s_separation=657.5,rate_terms=95,mpa=78,load_time=18.1,highpressure_time=82)
predict(prune.m,newdata = new.data,type="class")

## [1] 0
## Levels: 0 1

predict(prune.m2,newdata = new.data,type="class")

## [1] 0
## Levels: 0 1

2.1.7 예측 (2) 여러개의 데이터 예측

벡터

new.data=data.frame(fix_time=c(87,85.6),a_speed=c(0.609,0.472),b_speed=c(1.715,1.685),separation=c(242.7,243.4),s_separation=c(657.5,657.9),rate_terms=c(95,95),mpa=c(78,28.8),load_time=c(18.1,18.2),highpressure_time=c(82,60))
predict(prune.m,newdata = new.data,type="class")

## [1] 0 1
## Levels: 0 1

predict(prune.m2,newdata = new.data,type="class")

## [1] 0 1
## Levels: 0 1

데이터 프레임

new.data=data.frame(fix_time=test$fix_time,a_speed=test$a_speed,b_speed=test$b_speed,separation=test$separation,s_separation=test$s_separation,rate_terms=test$rate_terms,mpa=test$mpa,load_time=test$load_time,highpressure_time=test$highpressure_time)
head(predict(prune.m,newdata = new.data,type="class"))

## [1] 0 0 0 0 0 1
## Levels: 0 1

head(predict(prune.m,newdata = new.data,type="class"))

## [1] 0 0 0 0 0 1
## Levels: 0 1

2.1.8 다항 종속변수에 대하여 의사결정나무 모형 만들기

데이터 준비

autoparts2$g_class=as.factor(ifelse(autoparts2$c_thickness<20,1,ifelse(autoparts2$c_thickness<32,2,3)))
t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7) # 행 인덱스 추출 (70%)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]

모델 생성

m=tree(g_class~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=train)
plot(m)
text(m)

검증데이터 예측하기

yhat_test=predict(m,test,type="class")
confusionMatrix(yhat_test,as.factor(test$g_class))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3
##          1  427  138    0
##          2  211 5393    8
##          3    3  125  226
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9257         
##                  95% CI : (0.9191, 0.932)
##     No Information Rate : 0.866          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6974         
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity           0.66615   0.9535  0.96581
## Specificity           0.97657   0.7497  0.97967
## Pos Pred Value        0.75575   0.9610  0.63842
## Neg Pred Value        0.96413   0.7138  0.99870
## Prevalence            0.09815   0.8660  0.03583
## Detection Rate        0.06538   0.8258  0.03460
## Detection Prevalence  0.08651   0.8593  0.05420
## Balanced Accuracy     0.82136   0.8516  0.97274

2.1.9 연속형 종속변수에 대하여 의사결정나무 모형 만들기

종속변수가 연속형이므로 type=“class” 사용하지 않음
또한 table함수 사용 불가능

모델 생성

m=tree(c_thickness~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=train)
plot(m)
text(m)

검증데이터 예측

yhat_test=predict(m,test)
head(yhat_test)

##        5        6        9       12       19       22 
## 22.52917 22.52917 22.52917 22.52917 22.52917 33.13450

head(test$c_thickness)

## [1] 24.5 22.9 22.2 23.1 25.4 34.1

예측

new.data=data.frame(fix_time=c(87,85.6),a_speed=c(0.609,0.472),b_speed=c(1.715,1.685),separation=c(242.7,243.4),s_separation=c(657.5,657.9),rate_terms=c(95,95),mpa=c(78,28.8),load_time=c(18.1,18.2),highpressure_time=c(82,60))
predict(prune.m,newdata = new.data)

##        [,1]       [,2]
## 0 0.9337106 0.06628941
## 1 0.4170854 0.58291457

2.1.10 연습문제

rpart 패키지

rpart=rpart(as.factor(y_faulty)~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=train)
plot(rpart)
text(rpart)

printcp(rpart)

## 
## Classification tree:
## rpart(formula = as.factor(y_faulty) ~ fix_time + a_speed + b_speed + 
##     separation + s_separation + rate_terms + mpa + load_time + 
##     highpressure_time, data = train)
## 
## Variables actually used in tree construction:
## [1] b_speed      load_time    mpa          rate_terms   s_separation
## [6] separation  
## 
## Root node error: 1973/15236 = 0.1295
## 
## n= 15236 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.122656      0   1.00000 1.00000 0.021005
## 2  0.045109      2   0.75469 0.76128 0.018650
## 3  0.030917      3   0.70958 0.71566 0.018141
## 4  0.024835      4   0.67866 0.69285 0.017879
## 5  0.017739      7   0.60416 0.63153 0.017144
## 6  0.013431      8   0.58642 0.58642 0.016573
## 7  0.012924     13   0.51647 0.52965 0.015813
## 8  0.012164     20   0.42220 0.51597 0.015622
## 9  0.011151     21   0.41004 0.47846 0.015082
## 10 0.010644     23   0.38773 0.44247 0.014540
## 11 0.010000     24   0.37709 0.44146 0.014524

plotcp(rpart)

rpart.prune=rpart(as.factor(y_faulty)~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=train,control=rpart.control(cp=0.02))
plot(rpart.prune)
text(rpart.prune,pretty = 2)

2.2 KNN (K-Nearest Neighbors classfier)

새로운 점이 주어졌을 때, 그 점으로부터 가까운 점 K개를 이용하여 분류하는 머신러닝 기법
너무 작은 K를 선정하면 주변 소수의 데이터에 너무 큰 영향을 받는다.
너무 큰 K를 선정하면 관련이 없는 먼 곳의 데이터까지 분류에 영향을 끼치고, 정작 중요한 주변의 데이터 영향력은 작아진다.
Cross Validation(교차검증)을 통해 오분류율이 낮은 K를 선정한다.
K값은 일반적으로 홀 수를 취한다.
범주형 자료, 연속형 데이터 모두 사용 가능하다.

2.2.1 자료 준비

데이터 준비

autoparts=read.csv("data/autoparts.csv",header = TRUE)
autoparts1=autoparts[autoparts$prod_no=="90784-76001",c(2:11)]
autoparts2=autoparts1[autoparts1$c_thickness<1000,]
autoparts2$y_faulty=ifelse((autoparts2$c_thickness<20)|(autoparts2$c_thickness>32),1,0)

Train / Test set 나누기

t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7) # 행 인덱스 추출 (70%)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]

2.2.2 Argument 준비

훈련 데이터 행렬과 종속변수

xmat.train=as.matrix(train[1:9])
y_faulty.train=train$y_faulty
head(xmat.train)

##       fix_time a_speed b_speed separation s_separation rate_terms  mpa
## 2009      85.2   0.551   1.630      260.3        641.4         80 78.3
## 9183      86.0   0.612   1.655      244.0        651.8         81 77.6
## 15264     81.3   0.640   1.589      184.7        715.9         87 75.0
## 8147      89.9   0.606   1.697      252.3        649.8         80 78.4
## 4334      85.1   0.601   1.658      249.9        651.5         79 78.3
## 2272      85.3   0.553   1.609      261.8        635.8         79 79.4
##       load_time highpressure_time
## 2009       18.1                59
## 9183       18.2                59
## 15264      19.2                66
## 8147       18.1                76
## 4334       18.1                57
## 2272       18.2                61

검증 데이터 행렬

xmat.test=as.matrix(test[1:9])
head(xmat.test)

##    fix_time a_speed b_speed separation s_separation rate_terms  mpa
## 3      86.0   0.609   1.715      242.7        657.5         95 78.0
## 5      86.1   0.603   1.704      242.5        657.3         95 77.9
## 6      86.3   0.606   1.707      244.5        656.9         95 77.9
## 7      86.5   0.606   1.701      243.1        656.9         95 78.2
## 10     86.0   0.608   1.696      248.0        657.3         95 77.5
## 12     86.5   0.606   1.692      243.8        657.4         95 77.8
##    load_time highpressure_time
## 3       18.1                82
## 5       18.2                56
## 6       18.0                78
## 7       18.1                55
## 10      18.1                60
## 12      18.1                51

2.2.3 예측 값 생성

library(class)
yhat_test=knn(xmat.train,xmat.test,as.factor(y_faulty.train),k=3)
head(yhat_test,100)

##   [1] 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1
##  [36] 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
##  [71] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1
## Levels: 0 1

table=table(real=test$y_faulty,predict=yhat_test)
table

##     predict
## real    0    1
##    0 5576  103
##    1  212  640

confusionMatrix(yhat_test,as.factor(test$y_faulty))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5576  212
##          1  103  640
##                                           
##                Accuracy : 0.9518          
##                  95% CI : (0.9463, 0.9568)
##     No Information Rate : 0.8695          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7752          
##  Mcnemar's Test P-Value : 1.164e-09       
##                                           
##             Sensitivity : 0.9819          
##             Specificity : 0.7512          
##          Pos Pred Value : 0.9634          
##          Neg Pred Value : 0.8614          
##              Prevalence : 0.8695          
##          Detection Rate : 0.8538          
##    Detection Prevalence : 0.8862          
##       Balanced Accuracy : 0.8665          
##                                           
##        'Positive' Class : 0               
##

2.2.4 최적 k값 찾기

library(e1071)
tune.out=tune.knn(x=xmat.train,y=as.factor(y_faulty.train),k=1:10)
tune.out

## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  k
##  5
## 
## - best performance: 0.0450254

plot(tune.out)

최적 k=5로 knn 재수행

library(class)
yhat_test=knn(xmat.train,xmat.test,as.factor(y_faulty.train),k=5)
head(yhat_test,100)

##   [1] 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
##  [36] 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
##  [71] 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1
## Levels: 0 1

confusionMatrix(yhat_test,as.factor(test$y_faulty))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5584  201
##          1   95  651
##                                           
##                Accuracy : 0.9547          
##                  95% CI : (0.9493, 0.9596)
##     No Information Rate : 0.8695          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7891          
##  Mcnemar's Test P-Value : 1.041e-09       
##                                           
##             Sensitivity : 0.9833          
##             Specificity : 0.7641          
##          Pos Pred Value : 0.9653          
##          Neg Pred Value : 0.8727          
##              Prevalence : 0.8695          
##          Detection Rate : 0.8550          
##    Detection Prevalence : 0.8858          
##       Balanced Accuracy : 0.8737          
##                                           
##        'Positive' Class : 0               
##

2.2.5 ROC,AUC

library(Epi)
ROC(test=yhat_test,stat=test$y_faulty,plot="ROC",AUC=T,main="KNN")

2.2.6 데이터 예측

new.data=data.frame(fix_time=c(87,85.6),a_speed=c(0.609,0.472),b_speed=c(1.715,1.685),separation=c(242.7,243.4),s_separation=c(657.5,657.9),rate_terms=c(95,95),mpa=c(78,28.8),load_time=c(18.1,18.2),highpressure_time=c(82,60))
knn(xmat.train,new.data,y_faulty.train,k=5)

## [1] 0 1
## Levels: 0 1

2.2.7 종속변수가 다항인 경우

이항인 경우와 같음.

2.2.8 종속변수가 연속형인 경우 FNN::knn.reg()

autoparts=read.csv("data/autoparts.csv",header = TRUE)
autoparts1=autoparts[autoparts$prod_no=="90784-76001",c(2:11)]
autoparts2=autoparts1[autoparts1$c_thickness<1000,]
t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7) # 행 인덱스 추출 (70%)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]

xmat.train=as.matrix(train[1:9])
c_thickness.train=train$c_thickness
xmat.test=as.matrix(test[1:9])

library(FNN)

## 
## Attaching package: 'FNN'

## The following objects are masked from 'package:class':
## 
##     knn, knn.cv

yhat_test=knn.reg(xmat.train,xmat.test,c_thickness.train,k=3)
mse=mean((yhat_test$pred-test$c_thickness)^2)
mse

## [1] 1.625529

2.3 신경망(Neural Network)

인간의 뇌를 모방하여 만든 모델
입력층 / 은닉층 / 출력층으로 구성
입력 - 반응을 결정하는 대표적인 함수는 sigmoid 함수이다.
은닉층에서 정보의 조합이 어떻게 이루어지는지 실행 중에 파악하기가 어려워 결과 도출 과정을 설명하기 어렵고 모델 수정도 어렵다.
변수 선택에 매우 민감.
인공 신경망에서의 학습이란 노드와 노드 사이의 링크에 부여된 가중치를 조절하는 과정이다.
가중치를 계속 조절해가며 오차를 줄여 나가도록 한다.
가중치의 조정은 출력 노드로 부터 역방향으로 이루어지므로 역전파 알고리즘이라고 부른다.
입력 변수가 많아지면 입력 노드가 많아지고, 노드가 많아지면 추정해야하는 가중치의 수가 늘어나게 된다. 추정해야 할 가중치의 수가 늘어나게 되면 과적합이 발생할 가능성이 높아져 train 데이터의 예측력은 높더라도 test 데이터의 예측력이 떨어지게 된다.
따라서 종속변수와의 관계가 깊은 주요 변수를 최소한으로 선택하는 것이 필요하다.

2.3.1 자료 준비

autoparts=read.csv("data/autoparts.csv",header = TRUE)
autoparts1=autoparts[autoparts$prod_no=="90784-76001",c(2:11)]
autoparts2=autoparts1[autoparts1$c_thickness<1000,]
autoparts2$g_class=as.factor(ifelse(autoparts2$c_thickness<20,1,ifelse(autoparts2$c_thickness<32,2,3)))
t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7) # 행 인덱스 추출 (70%)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]

2.3.2 모델 생성

library(nnet)
m=nnet(g_class~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=train,size=10)

## # weights:  133
## initial  value 19970.351493 
## iter  10 value 7036.254440
## final  value 7035.997379 
## converged

2.3.3 성능 평가

yhat_test=predict(m,test,type="class")
table=table(real=test$g_class,predict=yhat_test)
table

##     predict
## real    2    3
##    1  638    0
##    2 5678    1
##    3  213    1

2.3.4 시각화

library(reshape2)
library(devtools)
source_url("https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_updata.r")

## SHA-1 hash of file is 74c80bd5ddbc17ab3ae5ece9c0ed9beb612e87ef

plot(m)

## Loading required package: scales

## Loading required package: reshape

## 
## Attaching package: 'reshape'

## The following objects are masked from 'package:reshape2':
## 
##     colsplit, melt, recast

## The following object is masked from 'package:class':
## 
##     condense

지도학습 알고리즘 2 - Decision Tree, KNN, 신경망

건국대학교 통계학과 백광렬 - 2018 빅데이터 청년인재

2018 7 17 (12일차)

2. 지도학습 알고리즘 (2)

2.1 의사결정나무 (Decision Tree)

2.1.1 데이터 준비

2.1.2 모델 생성

2.1.3 가지치기

2.1.4 검증 데이터 예측하기

2.1.5 ROC, AUC

2.1.6 예측 (1) 한개의 데이터 예측

2.1.7 예측 (2) 여러개의 데이터 예측

2.1.8 다항 종속변수에 대하여 의사결정나무 모형 만들기

2.1.9 연속형 종속변수에 대하여 의사결정나무 모형 만들기

2.1.10 연습문제

2.2 KNN (K-Nearest Neighbors classfier)

2.2.1 자료 준비

2.2.2 Argument 준비

2.2.3 예측 값 생성

2.2.4 최적 k값 찾기

2.2.5 ROC,AUC

2.2.6 데이터 예측

2.2.7 종속변수가 다항인 경우

2.2.8 종속변수가 연속형인 경우 FNN::knn.reg()

2.3 신경망(Neural Network)

2.3.1 자료 준비

2.3.2 모델 생성

2.3.3 성능 평가

2.3.4 시각화