1. 지도학습 알고리즘

1.1 ROC & AUC

이상적인 모델의 ROC커브는 높은 기준값에서도 TP가 높다면 좋은 모델이다. -컷트라인이 높은데도 불구하고 많은 행이 실제도 TRUE, 모델도 TRUE로 예측했기 때문이다
오분류율인 FP는 높은 기준값에서는 그 값이 적다가, 낮은 기준값에서는 높아진다면 정상적인 모델이다

ROC커브 아래의 면적으로, 클수록 좋은 모델이다.

1.2 Support Vector Machine

두 개 이상으로 나누어진 집단을 분류하는데 사용되는 머신러닝 기법이다.
분류 알고리즘 중에서 가장 정확도가 높은 것 중 하나이며 다양한 분류 상황에서 좋은 성능을 보이고, 이상치의 영향도 적게 받는 것으로 알려져 있다.
마진 : 데이터와 경계사이의 거리
support vector : 마진에서 가장 가까운 데이터
Hyperplane(초평면) : 이 두가지를 이용하여 그린 선

1.3 지도학습

1.3.1 데이터준비

autoparts <- read.csv("autoparts.csv", header = TRUE)
autoparts1 <- autoparts[autoparts$prod_no == "90784-76001", c(2:11)]
autoparts2 <- autoparts1[autoparts1$c_thickness < 1000, ]

autoparts2$y_faulty <- ifelse((autoparts2$c_thickness < 20) | (autoparts2$c_thickness > 32), 1, 0)

1.3.2 데이터셋 나누기

t_index <- sample(1:nrow(autoparts2), size = nrow(autoparts2) * 0.7)
train <- autoparts2[t_index, ]    # 훈련데이터 (70%)
test <- autoparts2[-t_index, ]    # 검증데이터 (30%)
nrow(train); nrow(test)

## [1] 15236

## [1] 6531

head(train)

##       fix_time a_speed b_speed separation s_separation rate_terms  mpa
## 8743      85.7   0.601   1.673      248.5        651.4         81 80.3
## 9883      85.0   0.603   1.715      248.8        651.8         93 79.4
## 17626     80.9   0.663   1.680      186.1        712.6         84 72.8
## 7812      85.7   0.601   1.677      249.0        649.9         81 77.8
## 15949     82.0   0.642   1.660      188.5        712.9         87 75.5
## 33339     82.7   0.596   1.603      224.4        675.9         84 72.9
##       load_time highpressure_time c_thickness y_faulty
## 8743       18.1                59        24.5        0
## 9883       18.1                63        23.8        0
## 17626      19.2                65        24.9        0
## 7812       18.1                59        25.5        0
## 15949      19.3                65        22.2        0
## 33339      20.2                62        23.4        0

1.3.3 파라미터 최적값 찾기

-cost가 높을수록 이상치 포함가능성 증가 -> 과대적합__ -cost가 낮을수록 이상치 포함가능성 감소 -> 과소적합 사이 적절한 값 -r(감마)이 커질수록 개체간 분산이 낮아지고(영향력이 높아지고) 곡선률(옆 개체에 미치는 영향력)이 커진다. -> 과대적합 -r(감마)이 작을수록 개체간 분산이 커지고(영향력이 낮아지고) 곡선률이 낮아짐(옆 개체에 미치는 영향력)이 낮아짐 -> 과소적합

1.3.4 최적값을 찾는 함수계산

결과값은 gamma 2, cost 16

install.packages("e1071")
library(e1071)
tune.svm(factor(y_faulty) ~ fix_time + a_speed + b_speed + separation + s_separation + 
           rate_terms + mpa + load_time + highpressure_time,
         data = autoparts2, gamma = 2^(-1:1), cost = 2^(2:4))

1.3.5 모델 1 : radial kernal

훈련데이터로 모델을 만든다

library(e1071)
m <- svm(factor(y_faulty) ~ fix_time + a_speed + b_speed + separation + 
           s_separation + rate_terms + mpa + load_time + highpressure_time, 
         data = train, gamma = 2, cost = 16)

1.3.6 예측값 및 정확도

-불량품을 postive로 본다 (불량을 정상품이라고 예측했을 때의 위험이 더 크기 때문)

yhat_test <- predict(m, test)
table <- table(real = test$y_faulty, predict = yhat_test)
table

##     predict
## real    0    1
##    0 5585  104
##    1  168  674

(table[1,1] + table[2,2]) / sum(table)

## [1] 0.9583525

1.3.7 모델 2 : linear kernal

m <- svm(factor(y_faulty) ~ fix_time + a_speed + b_speed + separation + s_separation + 
           rate_terms + mpa + load_time + highpressure_time, 
         data = train, gamma = 2, cost = 16, kernel = "linear")

yhat_test <- predict(m, test)
table <- table(real = test$y_faulty, predict = yhat_test);table

##     predict
## real    0    1
##    0 5592   97
##    1  453  389

(table[1,1] + table[2,2]) / sum(table)  #처음 모델보다 성능이 더 떨어졌다

## [1] 0.9157863

1.3.8 모델 3 : default keranl 기본모델

gamma와 cost 값을 지정해주지 않았다.

m <- svm(factor(y_faulty) ~ fix_time + a_speed + b_speed + separation + s_separation + rate_terms + mpa + load_time + highpressure_time, data = train)

yhat_test <- predict(m, test)
table <- table(real = test$y_faulty, predict = yhat_test);table

##     predict
## real    0    1
##    0 5607   82
##    1  469  373

파라미터 최적값을 찾은 경우에 미치지 못하는 것을 알 수 있다.

(table[1,1] + table[2,2]) / sum(table)

## [1] 0.9156331

1.3.9 ROC & AUC

#install.packages("Epi")
library(Epi)

m <- svm(factor(y_faulty) ~ fix_time + a_speed + b_speed + separation + s_separation +
           rate_terms + mpa + load_time + highpressure_time, data = train, gamma = 2, cost = 16)

yhat_test <- predict(m, test)

ROC(test = yhat_test, stat = test$y_faulty, plot = "ROC", AUC = T, main = "SVM")

1.3.10 새로운 데이터 예측

한 개 데이터 예측

new.data <- data.frame(fix_time = 87, a_speed = 0.609, b_speed = 1.715, 
                       separation = 242.7, s_separation = 657.5, rate_terms = 95, 
                       mpa = 78, load_time = 18.1, highpressure_time = 82)

predict(m, newdata = new.data)  # 새로운 하나의 데이터가 정상으로 분류되었다.

## 1 
## 0 
## Levels: 0 1

복수 데이터 예측 1(벡터)

new.data <- data.frame(fix_time = c(87, 85.7) , a_speed = c(0.609, 0472), 
                       b_speed = c(1.715, 1.685), separation = c(242.7, 243.4), 
                       s_separation = c(657.5, 657.9), rate_terms = c(95, 95), 
                       mpa = c(78, 28.2), load_time = c(18.1,18.2), highpressure_time = c(82,60))

predict(m, newdata = new.data)    # 새로운 1번 2번 데이터 모두 정상으로 분류되었다.

## 1 2 
## 0 0 
## Levels: 0 1

복수 데이터 예측 2(데이터 프레임)

-대리점 이탈 예측 모델에도 같은 방식이 사용 -여기서는 앞에서 test data로 분리된 것을 이용했다.

new.data <- data.frame(fix_time = test$fix_time, a_speed = test$a_speed,
                       b_speed = test$b_speed, separation = test$separation, 
                       s_separation = test$s_separation, rate_terms = test$rate_terms,
                       mpa = test$mpa, load_time = test$load_time, 
                       highpressure_time = test$highpressure_time)

head(predict(m, newdata = new.data))

## 1 2 3 4 5 6 
## 0 0 0 0 1 1 
## Levels: 0 1

1.4 Support vector Regression

서포트 벡터 머신은 종속변수가 연속형인 경우에도 적용할 수 있다 -(이산형인 경 우에 svm을 연속형인 경우 svr)
종속변수가 연속형인 경우에는 회귀분석과 유사한 결과를 내놓는다
이것을 서포트 벡터회귀 모형이라고한다
서로 다른 분류에 속한 관측치 사이에 간격이 최대가 되는 선을 찾아 이것을 선으로 연결한 것으로 연속된 수치로 예측이 가능하다.

1.4.1 자료준비

autoparts <- read.csv("autoparts.csv", header = TRUE)
autoparts1 <- autoparts[autoparts$prod_no == "90784-76001", c(2:11)]
autoparts2 <- autoparts1[autoparts1$c_thickness < 1000, ]

autoparts2$y_faulty <- ifelse((autoparts2$c_thickness < 20) | (autoparts2$c_thickness > 32), 1, 0)

t_index <- sample(1:nrow(autoparts2), size = nrow(autoparts2) * 0.7)
train <- autoparts2[t_index, ]
test <- autoparts2[-t_index, ]
NROW(train);NROW(test)

## [1] 15236

## [1] 6531

1.4.2 회귀 모델 생성

종속변수를 연속형

m <- svm(c_thickness ~ fix_time + a_speed + b_speed + separation + s_separation + 
           rate_terms + mpa + load_time + highpressure_time, 
         data = train, gamma = 2^(-1:1), cost = 2^(2:4)

결과값gamma = 2, cost = 16 값을 적용

library(e1071)
m <- svm(c_thickness ~ fix_time + a_speed + b_speed + separation + s_separation + 
           rate_terms + mpa + load_time + highpressure_time, 
         data = train, gamma = 2, cost = 16)

1.4.3 그래프

yhat_test <- predict(m, test)
plot(x = test$c_thickness, y = yhat_test, main = "SVR")

mse <- mean((yhat_test - test$c_thickness)^2); mse

## [1] 1.71277

1.4.4 다중선형 회귀모형과의 비교

m2 <- lm(c_thickness ~ fix_time + a_speed + b_speed + separation + s_separation +
           rate_terms + mpa + load_time + highpressure_time, data = train)

yhat_test <- predict(m2, test)
plot(x = test$c_thickness, y = yhat_test, main = "lm", xlab = "실제값", ylab = " 예측값")

mse <- mean((yhat_test - test$c_thickness)^2)

SVR은 예측값의 좌우 모두에서 부산이 큰 반면 다중선형 회귀모형은 우측 부분만 분산이 크게 나왔다.

1.4.5 단순선형 회귀모형과의 비교

regression <- read.csv("regression.csv", header = TRUE)
plot(regression$x, regression$y, pch = 16, xlab = "실제값", ylab = "예측값")

m1 <- lm(y ~ x, data = regression)
p1 <- predict(m1, newdata = regression)
plot(regression$x, p1, col = "blue", pch = "L")

m2 <- svm(y ~ x, regression)
p2 <- predict(m2, newdata = as.matrix(regression))
plot(regression$x, p2, col = "red", pch = "S")

연습문제

1개의 임의 데이터를 예측

new.data <- data.frame(x = 11)
predict(m2, newdata = new.data)

##        1 
## 32.51965

predict(m2, newdata = data.frame(x = 11))

##        1 
## 32.51965

2개의 임의 데이터를 예측하시오

new.data <- data.frame(x = c(1, 12))
predict(m2, newdata = new.data)

##         1         2 
##  7.667638 38.941073

SVM

장성환

2018 7