지도학습 알고리즘

1. 지도학습 알고리즘

1.1 ROC & AUC

AUC는 ROC커브 아래 면적이다.
AUC가 클수록 좋은 모델이다.

1.2 Support Vector Machine

2개 이상으로 나누어진 집단을 분류하는데 사용되는 머신러닝 기법
분류 알고리즘 중에서 가장 정확도가 높은 것 중 하나이다.
이상치의 영향도 적게 받는 것으로 알려져있다.
기본 방법은 데이터를 나누는 최적의 경계를 나누는 방식이다.

1.2.1 초평면 Hyperplane

데이터와 경계사이의 거리 = 마진
마진에서 가장 가까운 데이터 = support vector
이 두가지를 이용하여 그린 선 = Hyperplane

1.3 지도학습 알고리즘

1.3.1 데이터 준비

autoparts=read.csv("data/autoparts.csv",header = TRUE)
autoparts1=autoparts[autoparts$prod_no=="90784-76001",c(2:11)]
autoparts2=autoparts1[autoparts1$c_thickness<1000,]

autoparts2$y_faulty=ifelse((autoparts2$c_thickness<20)|(autoparts2$c_thickness>32),1,0)
# 정상이면 0 / 불량이면 1 
head(autoparts2,10)

##    fix_time a_speed b_speed separation s_separation rate_terms  mpa
## 1      85.5   0.611   1.715      242.0        657.6         95 78.2
## 2      86.2   0.606   1.708      244.7        657.1         95 77.9
## 3      86.0   0.609   1.715      242.7        657.5         95 78.0
## 4      86.1   0.610   1.718      241.9        657.3         95 78.2
## 5      86.1   0.603   1.704      242.5        657.3         95 77.9
## 6      86.3   0.606   1.707      244.5        656.9         95 77.9
## 7      86.5   0.606   1.701      243.1        656.9         95 78.2
## 8      86.4   0.607   1.707      243.1        657.3         95 77.5
## 9      86.3   0.604   1.711      245.2        656.9         95 77.8
## 10     86.0   0.608   1.696      248.0        657.3         95 77.5
##    load_time highpressure_time c_thickness y_faulty
## 1       18.1                58        24.7        0
## 2       18.2                58        22.5        0
## 3       18.1                82        24.1        0
## 4       18.1                74        25.1        0
## 5       18.2                56        24.5        0
## 6       18.0                78        22.9        0
## 7       18.1                55        24.3        0
## 8       18.1                57        23.9        0
## 9       18.0                50        22.2        0
## 10      18.1                60        19.0        1

1.3.2 데이터셋 나누기

t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7) # 행 인덱스 추출 (70%)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]
nrow(train)/(nrow(train)+nrow(test))

## [1] 0.6999587

nrow(test)/(nrow(train)+nrow(test))

## [1] 0.3000413

1.3.3 파라미터 최적값 찾기 e1071::tune.svm(factor~v1+v2+v3…,gamma,cost)

gamma : 초평면의 기울기
cost : 과적합에 따른 비용, 과적합 될 수록 cost가 상승 / 어느 정도의 비용을 감수하더라도 모델을 훈련 데이터에 맞추겠는지?

tune.svm(factor(y_faulty)
  ~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=autoparts2,gamma=2^(-1:1),cost=2^(2:4))

1.3.4 모델1 svm(kernel=“radial”)

앞에서 찾은 최적 감마와 코스트를 기반으로 모델 생성

모델 생성

m=svm(factor(y_faulty)
  ~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=train,gamma=2,cost=16)
summary(m)

## 
## Call:
## svm(formula = factor(y_faulty) ~ fix_time + a_speed + b_speed + 
##     separation + s_separation + rate_terms + mpa + load_time + 
##     highpressure_time, data = train, gamma = 2, cost = 16)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  16 
##       gamma:  2 
## 
## Number of Support Vectors:  2213
## 
##  ( 1202 1011 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

검증 데이터로 예측값 및 정확도 구하기

yhat_test=predict(m,test)
table=table(real=test$y_faulty,predict=yhat_test)
table

##     predict
## real    0    1
##    0 5559   90
##    1  172  710

(table[1,1]+table[2,2])/sum(table) # 정분류율

## [1] 0.9598836

table[1,1]/(table[1,1]+table[1,2]) # 소수집단 정분류율

## [1] 0.984068

1.3.5 모델2 svm(kernel=“linear”)

앞에서 찾은 최적 감마와 코스트를 기반으로 모델 생성

모델 생성

m2=svm(factor(y_faulty)
  ~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=train,gamma=2,cost=16,kernel="linear")
summary(m2)

## 
## Call:
## svm(formula = factor(y_faulty) ~ fix_time + a_speed + b_speed + 
##     separation + s_separation + rate_terms + mpa + load_time + 
##     highpressure_time, data = train, gamma = 2, cost = 16, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  16 
##       gamma:  2 
## 
## Number of Support Vectors:  3195
## 
##  ( 1600 1595 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

검증 데이터로 예측값 및 정확도 구하기

yhat_test2=predict(m2,test)
table2=table(real=test$y_faulty,predict=yhat_test2)
table2

##     predict
## real    0    1
##    0 5572   77
##    1  485  397

(table2[1,1]+table2[2,2])/sum(table) # 정분류율

## [1] 0.9139489

table2[1,1]/(table2[1,1]+table2[1,2]) # 소수집단 정분류율

## [1] 0.9863693

1.3.6 모델3 : defalut모델 svm()

앞에서 찾은 최적 감마와 코스트를 기반으로 모델 생성

모델 생성

m3=svm(factor(y_faulty)
  ~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=train)
summary(m3)

## 
## Call:
## svm(formula = factor(y_faulty) ~ fix_time + a_speed + b_speed + 
##     separation + s_separation + rate_terms + mpa + load_time + 
##     highpressure_time, data = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.1111111 
## 
## Number of Support Vectors:  3375
## 
##  ( 1701 1674 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

검증 데이터로 예측값 및 정확도 구하기

yhat_test3=predict(m3,test)
table3=table(real=test$y_faulty,predict=yhat_test3)
table3

##     predict
## real    0    1
##    0 5592   57
##    1  491  391

(table3[1,1]+table3[2,2])/sum(table3) # 정분류율

## [1] 0.9160925

table3[1,1]/(table3[1,1]+table3[1,2]) # 소수집단 정분류율

## [1] 0.9899097

1.3.7 ROC,AUC

ROC(test=yhat_test,stat=test$y_faulty,plot="ROC",AUC=T,main="SVM")

1.3.8 새로운 데이터 예측 (1) 한 개 데이터 예측

new.data=data.frame(fix_time=87,a_speed=0.609,b_speed=1.715,separation=242.7,s_separation=657.5,rate_terms=95,mpa=78,load_time=18.1,highpressure_time=82)
predict(m,newdata = new.data)

## 1 
## 0 
## Levels: 0 1

1.3.9 새로운 데이터 예측 (2) 복수 데이터 예측

벡터

new.data=data.frame(fix_time=c(87,85.6),a_speed=c(0.609,0.472),b_speed=c(1.715,1.685),separation=c(242.7,243.4),s_separation=c(657.5,657.9),rate_terms=c(95,95),mpa=c(78,28.8),load_time=c(18.1,18.2),highpressure_time=c(82,60))
predict(m,newdata = new.data)

## 1 2 
## 0 1 
## Levels: 0 1

데이터 프레임

new.data2=data.frame(fix_time=test$fix_time,a_speed=test$a_speed,b_speed=test$b_speed,separation=test$separation,s_separation=test$s_separation,rate_terms=test$rate_terms,mpa=test$mpa,load_time=test$load_time,highpressure_time=test$highpressure_time)
table(predict(m,newdata = new.data2))

## 
##    0    1 
## 5731  800

1.4 Support Vector Regression

종속변수가 연속형인 경우에도 적용할 수 있다.
종속변수가 연속형인 경우에는 회귀분석과 유사한 결과를 내놓는다.
서로 다른 분류에 속한 관측치 사이의 간격이 최대가 되는 선을 찾아 이것을 선으로 연결한 것.
바이너리한 값이 아니라 연속된 수치로 예측이 가능하다.
cost가 높을수록 이상치를 포함할 가능성이 높아서 과대적합 위험
cost가 작을수록 이상치를 포함할 가능성이 낮아져서 과소적합 위험
gamma가 클수록 개체간 분산이 작아져서 곡선률이 커지고, 개체간 영향력이 커져서 옆 개체에 미치는 영량력이 커져서 과대적합 위험
gamma가 작을수록 개체간 분산이 커져서 곡선률이 작아지고, 개체간 영향력이 작아져서 옆 개체에 미치는 영량력이 작아져서 과대적합 위험

1.4.1 데이터 준비

autoparts=read.csv("data/autoparts.csv",header = TRUE)
autoparts1=autoparts[autoparts$prod_no=="90784-76001",c(2:11)]
autoparts2=autoparts1[autoparts1$c_thickness<1000,]

autoparts2$y_faulty=ifelse((autoparts2$c_thickness<20)|(autoparts2$c_thickness>32),1,0)
# 정상이면 0 / 불량이면 1 
head(autoparts2,10)

##    fix_time a_speed b_speed separation s_separation rate_terms  mpa
## 1      85.5   0.611   1.715      242.0        657.6         95 78.2
## 2      86.2   0.606   1.708      244.7        657.1         95 77.9
## 3      86.0   0.609   1.715      242.7        657.5         95 78.0
## 4      86.1   0.610   1.718      241.9        657.3         95 78.2
## 5      86.1   0.603   1.704      242.5        657.3         95 77.9
## 6      86.3   0.606   1.707      244.5        656.9         95 77.9
## 7      86.5   0.606   1.701      243.1        656.9         95 78.2
## 8      86.4   0.607   1.707      243.1        657.3         95 77.5
## 9      86.3   0.604   1.711      245.2        656.9         95 77.8
## 10     86.0   0.608   1.696      248.0        657.3         95 77.5
##    load_time highpressure_time c_thickness y_faulty
## 1       18.1                58        24.7        0
## 2       18.2                58        22.5        0
## 3       18.1                82        24.1        0
## 4       18.1                74        25.1        0
## 5       18.2                56        24.5        0
## 6       18.0                78        22.9        0
## 7       18.1                55        24.3        0
## 8       18.1                57        23.9        0
## 9       18.0                50        22.2        0
## 10      18.1                60        19.0        1

t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7) # 행 인덱스 추출 (70%)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]
nrow(train)/(nrow(train)+nrow(test))

## [1] 0.6999587

nrow(test)/(nrow(train)+nrow(test))

## [1] 0.3000413

1.4.2 서포트 벡터 회귀 모델 1

모델 생성

m=svm(c_thickness~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=train,gamma=2,cost=16)
summary(m)

## 
## Call:
## svm(formula = c_thickness ~ fix_time + a_speed + b_speed + separation + 
##     s_separation + rate_terms + mpa + load_time + highpressure_time, 
##     data = train, gamma = 2, cost = 16)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  16 
##       gamma:  2 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  2897

그래프 확인

yhat_test=predict(m,test)
plot(x=test$c_thickness,y=yhat_test,main="SVR")

mse=mean((yhat_test-test$c_thickness^2))
mse

## [1] -559.6115

다중선형회귀모형과의 비교 (1) 모형생성

m2=lm(c_thickness~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=test)
summary(m2)

## 
## Call:
## lm(formula = c_thickness ~ fix_time + a_speed + b_speed + separation + 
##     s_separation + rate_terms + mpa + load_time + highpressure_time, 
##     data = test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.0967  -0.5954  -0.0249   0.5677  26.4781 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        7.150e+02  6.359e+00  112.444  < 2e-16 ***
## fix_time           2.688e-02  9.589e-03    2.803  0.00507 ** 
## a_speed           -1.726e+01  8.127e-01  -21.238  < 2e-16 ***
## b_speed            1.872e+00  2.844e-01    6.583 4.97e-11 ***
## separation        -7.537e-01  6.858e-03 -109.913  < 2e-16 ***
## s_separation      -7.441e-01  6.935e-03 -107.293  < 2e-16 ***
## rate_terms         5.966e-03  6.715e-03    0.888  0.37433    
## mpa               -1.603e-01  2.770e-03  -57.877  < 2e-16 ***
## load_time         -1.276e-01  1.687e-02   -7.566 4.39e-14 ***
## highpressure_time  3.274e-05  2.877e-05    1.138  0.25516    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.86 on 6521 degrees of freedom
## Multiple R-squared:  0.7743, Adjusted R-squared:  0.774 
## F-statistic:  2485 on 9 and 6521 DF,  p-value: < 2.2e-16

다중선형회귀모형과의 비교 (2) mse, plot 비교

SVR은 그레프의 양 쯕에서 예측력이 떨어지는 경향이 있다.

yhat_test2=predict(m2,test)
par(mfrow=c(1,2))
plot(x=test$c_thickness,y=yhat_test,main="SVR")
plot(x=test$c_thickness,y=yhat_test2,main="lm",xlab="실제값",ylab="예측값")

mse=mean((yhat_test2-test$c_thickness)^2) # lm의 mse가 SVR보다 2배이상크다

단순선형 회귀모형과의 비교

regression=read.csv("data/regression.csv",header = TRUE)
plot(regression$x,regression$y,pch=16,xlab="실제값",ylab="예측값")

# LM

m1=lm(y~x,data=regression)
p1=predict(m1,newdata=regression)
points(regression$x,p1,col="blue",pch="L")

# SVR

m2=svm(y~x,regression)
p2=predict(m2,newdata=as.matrix(regression))
points(regression$x,p2,col="red",pch="S")

1.4.3 연습문제

# 1개 

new.data=data.frame(x=10)
predict(m2,newdata=new.data)

##        1 
## 25.92276

# 2개

new.data=data.frame(x=c(10,20))
predict(m2,newdata=new.data)

##        1        2 
## 25.92276 59.96128

1.5 로지스틱 회귀분석

종속변수가 이분형이거나 범주형인 경우
오즈비 : 오즈/(1-오즈)
오즈 = 조건부 확률, x가 주어졌을 때 y가 성공일 확률

1.5.1 데이터 준비

autoparts=read.csv("data/autoparts.csv",header = TRUE)
autoparts1=autoparts[autoparts$prod_no=="90784-76001",c(2:11)]
autoparts2=autoparts1[autoparts1$c_thickness<1000,]

autoparts2$y_faulty=ifelse(autoparts2$c_thickness<20|autoparts2$c_thickness>32,1,0)
autoparts2$y_faulty=as.factor(autoparts2$y_faulty) # factor로 형변환

1.5.2 테이블 작성

table=table(autoparts2$y_faulty)

table[2]/sum(table[1]+table[2])*100 # 불량률

##        1 
## 13.05646

1.5.3 모형 생성

m=glm(y_faulty~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=autoparts2,family = binomial(logit))
summary(m)

## 
## Call:
## glm(formula = y_faulty ~ fix_time + a_speed + b_speed + separation + 
##     s_separation + rate_terms + mpa + load_time + highpressure_time, 
##     family = binomial(logit), data = autoparts2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.3641  -0.3738  -0.2150  -0.1184   5.1771  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -4.511e+02  9.981e+00 -45.197  < 2e-16 ***
## fix_time          -3.034e-02  9.623e-03  -3.153 0.001617 ** 
## a_speed            1.965e+01  9.472e-01  20.743  < 2e-16 ***
## b_speed           -1.854e+00  3.970e-01  -4.670 3.01e-06 ***
## separation         5.322e-01  1.121e-02  47.476  < 2e-16 ***
## s_separation       4.957e-01  1.094e-02  45.320  < 2e-16 ***
## rate_terms        -2.332e-02  6.817e-03  -3.422 0.000623 ***
## mpa               -1.416e-01  3.367e-03 -42.043  < 2e-16 ***
## load_time          1.835e-03  1.508e-02   0.122 0.903124    
## highpressure_time  1.787e-04  1.863e-05   9.595  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 16867.6  on 21766  degrees of freedom
## Residual deviance:  9993.8  on 21757  degrees of freedom
## AIC: 10014
## 
## Number of Fisher Scoring iterations: 6

1.5.4 데이터셋 나누기 (train:test=7:3)

t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]

1.5.5 train model 생성

m=glm(y_faulty~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=train,family = binomial(logit))
summary(m)

## 
## Call:
## glm(formula = y_faulty ~ fix_time + a_speed + b_speed + separation + 
##     s_separation + rate_terms + mpa + load_time + highpressure_time, 
##     family = binomial(logit), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4592  -0.3680  -0.2067  -0.1102   4.5351  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -4.699e+02  1.228e+01 -38.275  < 2e-16 ***
## fix_time          -3.090e-02  1.174e-02  -2.631  0.00851 ** 
## a_speed            2.090e+01  1.144e+00  18.273  < 2e-16 ***
## b_speed           -2.151e+00  4.914e-01  -4.377  1.2e-05 ***
## separation         5.543e-01  1.380e-02  40.171  < 2e-16 ***
## s_separation       5.158e-01  1.342e-02  38.432  < 2e-16 ***
## rate_terms        -2.502e-02  8.234e-03  -3.038  0.00238 ** 
## mpa               -1.470e-01  4.113e-03 -35.746  < 2e-16 ***
## load_time          3.012e-02  1.877e-02   1.605  0.10854    
## highpressure_time  1.868e-04  2.106e-05   8.867  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 11824.5  on 15235  degrees of freedom
## Residual deviance:  6878.8  on 15226  degrees of freedom
## AIC: 6898.8
## 
## Number of Fisher Scoring iterations: 6

head(m$fitted.values)

##       3424       9307      25135      17684       1601       4643 
## 0.13208652 0.11175087 0.01575692 0.01822328 0.08647116 0.01579807

1.5.6 기준값 설정

yhat=ifelse(m$fitted.values>=0.5,1,0)
head(yhat)

##  3424  9307 25135 17684  1601  4643 
##     0     0     0     0     0     0

1.5.7 빈도표 작성

table=table(real=train$y_faulty,predict=yhat)
table

##     predict
## real     0     1
##    0 13003   239
##    1  1076   918

1.5.8 test데이터를 모델에 넣어 예측값 구하기

yhat_test=predict(m,test,type="response")
head(yhat_test,n=10)

##          1          5         11         13         16         22 
## 0.02955854 0.02976147 0.39680302 0.23728760 0.84215663 0.30368969 
##         27         28         29         40 
## 0.07477993 0.03763883 0.26040496 0.06784719

1.5.9 ROC & AUC

ROC(test=yhat_test,stat = test$y_faulty,plot="ROC",AUC=T,main="logistics regression")

1.5.10 새로운 데이터 예측 (1) 한 개 데이터 예측

new.data=data.frame(fix_time=87,a_speed=0.609,b_speed=1.715,separation=242.7,s_separation=657.5,rate_terms=95,mpa=78,load_time=18.1,highpressure_time=82)
possibilityof1=predict(m,newdata=new.data,type="response")
ifelse(possibilityof1>=0.5,1,0)

## 1 
## 0

1.5.11 새로운 데이터 예측 (2) 복수 데이터 예측

벡터

new.data=data.frame(fix_time=c(87,85.6),a_speed=c(0.609,0.472),b_speed=c(1.715,1.685)
                    ,separation=c(242.7,243.4),s_separation=c(657.5,657.9)
                    ,rate_terms=c(95,95),mpa=c(78,28.8),load_time=c(18.1,18.2)
                    ,highpressure_time=c(82,60))
possibilityof1=predict(m,newdata=new.data,type="response")
ifelse(possibilityof1>=0.5,1,0)

## 1 2 
## 0 1

데이터 프레임

new.data2=data.frame(fix_time=test$fix_time,a_speed=test$a_speed,b_speed=test$b_speed,separation=test$separation,s_separation=test$s_separation,rate_terms=test$rate_terms,mpa=test$mpa,load_time=test$load_time,highpressure_time=test$highpressure_time)
possibilityof1=predict(m,newdata=new.data2,type="response")
head(ifelse(possibilityof1>=0.5,1,0),5)

## 1 2 3 4 5 
## 0 0 0 0 1

1.6 다항 로지스틱 회귀분석

1.6.1 다항 종속변수 생성

autoparts2$g_class=as.factor(ifelse(autoparts2$c_thickness<20,1,ifelse(autoparts2$c_thickness<32,2,3)))
table(autoparts2$g_class)

## 
##     1     2     3 
##  2141 18914   712

1.6.2 데이터셋 나누기(train:test=7:3)

t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7) # 행 인덱스 추출 (70%)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]

1.6.3 train model 생성

m=multinom(g_class~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
  +load_time+highpressure_time,data=train)

## # weights:  33 (20 variable)
## initial  value 16738.456830 
## iter  10 value 5406.520958
## iter  20 value 4705.962846
## iter  30 value 3734.374204
## iter  40 value 3098.445042
## iter  50 value 3051.399959
## iter  60 value 2971.597886
## iter  70 value 2545.453547
## iter  80 value 1961.458206
## iter  90 value 1960.001423
## final  value 1959.994245 
## converged

summary(m)

## Call:
## multinom(formula = g_class ~ fix_time + a_speed + b_speed + separation + 
##     s_separation + rate_terms + mpa + load_time + highpressure_time, 
##     data = train)
## 
## Coefficients:
##   (Intercept)  fix_time   a_speed  b_speed separation s_separation
## 2    1792.723 0.1887038 -15.13607 4.145293  -1.934316    -1.946824
## 3    1884.479 0.1901040 -25.65636 3.712879  -2.016982    -2.037163
##   rate_terms        mpa  load_time highpressure_time
## 2 0.08317144 -0.6875959 -0.1520803      0.0002742946
## 3 0.06217785 -0.8129465 -0.1678857      0.0003011845
## 
## Std. Errors:
##    (Intercept)   fix_time     a_speed      b_speed  separation
## 2 1.270423e-05 0.04913755 0.001059229 2.507332e-04 0.007366989
## 3 1.859757e-05 0.05019033 0.000202918 2.348695e-05 0.007697641
##   s_separation rate_terms        mpa  load_time highpressure_time
## 2  0.004573189 0.01451830 0.01333213 0.07239175      0.0001547160
## 3  0.004920469 0.02227692 0.01426074 0.07394874      0.0001572982
## 
## Residual Deviance: 3919.988 
## AIC: 3959.988

head(m$fitted.values)

##                  1         2            3
## 23516 6.557632e-05 0.9991734 0.0007609939
## 2540  1.364256e-03 0.9951095 0.0035261950
## 25820 6.068365e-04 0.9986313 0.0007618756
## 2641  2.993092e-05 0.9962637 0.0037063585
## 1970  3.932116e-07 0.5680646 0.4319349761
## 33215 5.137565e-05 0.9980095 0.0019390832

1.6.4 test데이터로 예측값 및 정확도 구하기

yhat_test=predict(m,test)
table=table(real=test$g_class,predict=yhat_test)
table

##     predict
## real    1    2    3
##    1  491  142    1
##    2   64 5535   99
##    3    0   35  164

(table[1.1]+table[2,2]+table[3,3])/sum(table)

## [1] 0.9477875

1.6.5 연습문제 1

occupancy데이터 로지스틱 적합 / test데이터 예측값과 정확도 구하기 / ROC, AUC 구하기

데이터 불러오기

train=read.csv("data/occupancy_train.csv")
head(train) # 변수확인

##                  date Temperature Humidity Light    CO2 HumidityRatio
## 1 2015-02-04 17:51:00       23.18  27.2720 426.0 721.25   0.004792988
## 2 2015-02-04 17:51:59       23.15  27.2675 429.5 714.00   0.004783441
## 3 2015-02-04 17:53:00       23.15  27.2450 426.0 713.50   0.004779464
## 4 2015-02-04 17:54:00       23.15  27.2000 426.0 708.25   0.004771509
## 5 2015-02-04 17:55:00       23.10  27.2000 426.0 704.50   0.004756993
## 6 2015-02-04 17:55:59       23.10  27.2000 419.0 701.00   0.004756993
##   Occupancy
## 1         1
## 2         1
## 3         1
## 4         1
## 5         1
## 6         1

로지스틱 모형 적합

m=glm(Occupancy~Light+CO2,data=train)
summary(m)

## 
## Call:
## glm(formula = Occupancy ~ Light + CO2, data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.50187   0.00353   0.02293   0.02650   0.76880  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.376e-01  4.196e-03  -32.79   <2e-16 ***
## Light        1.632e-03  1.226e-05  133.07   <2e-16 ***
## CO2          2.554e-04  7.598e-06   33.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.02596176)
## 
##     Null deviance: 1361.88  on 8142  degrees of freedom
## Residual deviance:  211.33  on 8140  degrees of freedom
## AIC: -6617.3
## 
## Number of Fisher Scoring iterations: 2

test 데이터의 예측값과 정확도 구하기

기준 값 설정

yhat=ifelse(m$fitted.values>=0.5,1,0)
head(yhat)

## 1 2 3 4 5 6 
## 1 1 1 1 1 1

빈도표 작성

table=table(real=train$Occupancy,predict=yhat)
table

##     predict
## real    0    1
##    0 6227  187
##    1    3 1726

test데이터를 모델에 넣어 예측값 구하기

test=read.csv("data/occupancy_test.csv")
yhat_test=predict(m,test,type="response")
head(yhat_test,n=10)

##         1         2         3         4         5         6         7 
## 1.0086226 1.0003872 0.9933986 0.8659269 0.8586089 0.9920646 0.9413488 
##         8         9        10 
## 0.8964930 0.8442296 0.9011894

ROC & AUC

ROC(test=yhat_test,stat = test$Occupancy,plot="ROC",AUC=T,main="logistics regression")

1.6.5 연습문제 2

occupancy데이터 SVM 적합(g=default,cost=10) / test데이터 예측값과 정확도 구하기 / ROC, AUC 구하기

데이터 불러오기

train=read.csv("data/occupancy_train.csv")
test=read.csv("data/occupancy_test.csv")
head(train) # 변수확인

##                  date Temperature Humidity Light    CO2 HumidityRatio
## 1 2015-02-04 17:51:00       23.18  27.2720 426.0 721.25   0.004792988
## 2 2015-02-04 17:51:59       23.15  27.2675 429.5 714.00   0.004783441
## 3 2015-02-04 17:53:00       23.15  27.2450 426.0 713.50   0.004779464
## 4 2015-02-04 17:54:00       23.15  27.2000 426.0 708.25   0.004771509
## 5 2015-02-04 17:55:00       23.10  27.2000 426.0 704.50   0.004756993
## 6 2015-02-04 17:55:59       23.10  27.2000 419.0 701.00   0.004756993
##   Occupancy
## 1         1
## 2         1
## 3         1
## 4         1
## 5         1
## 6         1

svm 모형 적합

m=svm(as.factor(Occupancy)~Light+CO2,data=train,cost=10,kernel="linear")
summary(m)

## 
## Call:
## svm(formula = as.factor(Occupancy) ~ Light + CO2, data = train, 
##     cost = 10, kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
##       gamma:  0.5 
## 
## Number of Support Vectors:  443
## 
##  ( 222 221 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

test 데이터의 예측값과 정확도 구하기

yhat_test=predict(m,test)
table=table(real=test$Occupancy,predict=yhat_test)
table

##     predict
## real    0    1
##    0 1638   55
##    1    3  969

(table[1,1]+table[2,2])/sum(table) # 정분류율

## [1] 0.9782364

table[1,1]/(table[1,1]+table[1,2]) # 소수집단 정분류율

## [1] 0.9675133

ROC,AUC

ROC(test=yhat_test,stat=test$Occupancy,plot="ROC",AUC=T,main="SVM")

LASSO

train=read.csv("data/occupancy_train.csv")

xmat=as.matrix(train[c(2,3,4,5)])
head(xmat)

##      Temperature Humidity Light    CO2
## [1,]       23.18  27.2720 426.0 721.25
## [2,]       23.15  27.2675 429.5 714.00
## [3,]       23.15  27.2450 426.0 713.50
## [4,]       23.15  27.2000 426.0 708.25
## [5,]       23.10  27.2000 426.0 704.50
## [6,]       23.10  27.2000 419.0 701.00

yvec=train$Occupancy

library(glmnet)

## Loading required package: Matrix

## Loading required package: foreach

## Loaded glmnet 2.0-16

fit.lasso=glmnet(x=xmat,y=yvec,alpha = 1,nlambda = 100) # 람다 100개 생성
fit.lasso.cv=cv.glmnet(x=xmat,y=yvec,nfolds = 10,alpha=1,lambda = fit.lasso$lambda)
plot(fit.lasso.cv)

fit.lasso.param=fit.lasso.cv$lambda.min # 최적의 람다를 다른 이름으로 저장 
fit.lasso.tune=glmnet(x=xmat,y=yvec,alpha=1,lambda = fit.lasso.param) # 최적 람다를 이용한 최종 LASSO 모델
coef(fit.lasso.tune)

## 5 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept)  1.0788878970
## Temperature -0.0592480903
## Humidity    -0.0020021089
## Light        0.0017578326
## CO2          0.0003239741

지도학습 알고리즘

건국대학교 통계학과 백광렬 - 2018 빅데이터 청년인재

2018 7 16 (11일차)

1. 지도학습 알고리즘

1.1 ROC & AUC

1.2 Support Vector Machine

1.2.1 초평면 Hyperplane

1.3 지도학습 알고리즘

1.3.1 데이터 준비

1.3.2 데이터셋 나누기

1.3.3 파라미터 최적값 찾기 e1071::tune.svm(factor~v1+v2+v3…,gamma,cost)

1.3.4 모델1 svm(kernel=“radial”)

1.3.5 모델2 svm(kernel=“linear”)

1.3.6 모델3 : defalut모델 svm()

1.3.7 ROC,AUC

1.3.8 새로운 데이터 예측 (1) 한 개 데이터 예측

1.3.9 새로운 데이터 예측 (2) 복수 데이터 예측

1.4 Support Vector Regression

1.4.1 데이터 준비

1.4.2 서포트 벡터 회귀 모델 1

1.4.3 연습문제

1.5 로지스틱 회귀분석

1.5.1 데이터 준비

1.5.2 테이블 작성

1.5.3 모형 생성

1.5.4 데이터셋 나누기 (train:test=7:3)

1.5.5 train model 생성

1.5.6 기준값 설정

1.5.7 빈도표 작성

1.5.8 test데이터를 모델에 넣어 예측값 구하기

1.5.9 ROC & AUC

1.5.10 새로운 데이터 예측 (1) 한 개 데이터 예측

1.5.11 새로운 데이터 예측 (2) 복수 데이터 예측

1.6 다항 로지스틱 회귀분석

1.6.1 다항 종속변수 생성

1.6.2 데이터셋 나누기(train:test=7:3)

1.6.3 train model 생성

1.6.4 test데이터로 예측값 및 정확도 구하기

1.6.5 연습문제 1

1.6.5 연습문제 2