1. 지도학습 알고리즘
1.1 ROC & AUC
- AUC는 ROC커브 아래 면적이다.
- AUC가 클수록 좋은 모델이다.
1.2 Support Vector Machine
- 2개 이상으로 나누어진 집단을 분류하는데 사용되는 머신러닝 기법
- 분류 알고리즘 중에서 가장 정확도가 높은 것 중 하나이다.
- 이상치의 영향도 적게 받는 것으로 알려져있다.
- 기본 방법은 데이터를 나누는 최적의 경계를 나누는 방식이다.
1.2.1 초평면 Hyperplane
- 데이터와 경계사이의 거리 = 마진
- 마진에서 가장 가까운 데이터 = support vector
- 이 두가지를 이용하여 그린 선 = Hyperplane
1.3 지도학습 알고리즘
1.3.1 데이터 준비
autoparts=read.csv("data/autoparts.csv",header = TRUE)
autoparts1=autoparts[autoparts$prod_no=="90784-76001",c(2:11)]
autoparts2=autoparts1[autoparts1$c_thickness<1000,]autoparts2$y_faulty=ifelse((autoparts2$c_thickness<20)|(autoparts2$c_thickness>32),1,0)
# 정상이면 0 / 불량이면 1
head(autoparts2,10)## fix_time a_speed b_speed separation s_separation rate_terms mpa
## 1 85.5 0.611 1.715 242.0 657.6 95 78.2
## 2 86.2 0.606 1.708 244.7 657.1 95 77.9
## 3 86.0 0.609 1.715 242.7 657.5 95 78.0
## 4 86.1 0.610 1.718 241.9 657.3 95 78.2
## 5 86.1 0.603 1.704 242.5 657.3 95 77.9
## 6 86.3 0.606 1.707 244.5 656.9 95 77.9
## 7 86.5 0.606 1.701 243.1 656.9 95 78.2
## 8 86.4 0.607 1.707 243.1 657.3 95 77.5
## 9 86.3 0.604 1.711 245.2 656.9 95 77.8
## 10 86.0 0.608 1.696 248.0 657.3 95 77.5
## load_time highpressure_time c_thickness y_faulty
## 1 18.1 58 24.7 0
## 2 18.2 58 22.5 0
## 3 18.1 82 24.1 0
## 4 18.1 74 25.1 0
## 5 18.2 56 24.5 0
## 6 18.0 78 22.9 0
## 7 18.1 55 24.3 0
## 8 18.1 57 23.9 0
## 9 18.0 50 22.2 0
## 10 18.1 60 19.0 1
1.3.2 데이터셋 나누기
t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7) # 행 인덱스 추출 (70%)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]
nrow(train)/(nrow(train)+nrow(test))## [1] 0.6999587
nrow(test)/(nrow(train)+nrow(test))## [1] 0.3000413
1.3.3 파라미터 최적값 찾기 e1071::tune.svm(factor~v1+v2+v3…,gamma,cost)
- gamma : 초평면의 기울기
- cost : 과적합에 따른 비용, 과적합 될 수록 cost가 상승 / 어느 정도의 비용을 감수하더라도 모델을 훈련 데이터에 맞추겠는지?
tune.svm(factor(y_faulty)
~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
+load_time+highpressure_time,data=autoparts2,gamma=2^(-1:1),cost=2^(2:4))1.3.4 모델1 svm(kernel=“radial”)
- 앞에서 찾은 최적 감마와 코스트를 기반으로 모델 생성
- 모델 생성
m=svm(factor(y_faulty)
~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
+load_time+highpressure_time,data=train,gamma=2,cost=16)
summary(m)##
## Call:
## svm(formula = factor(y_faulty) ~ fix_time + a_speed + b_speed +
## separation + s_separation + rate_terms + mpa + load_time +
## highpressure_time, data = train, gamma = 2, cost = 16)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 16
## gamma: 2
##
## Number of Support Vectors: 2213
##
## ( 1202 1011 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
- 검증 데이터로 예측값 및 정확도 구하기
yhat_test=predict(m,test)
table=table(real=test$y_faulty,predict=yhat_test)
table## predict
## real 0 1
## 0 5559 90
## 1 172 710
(table[1,1]+table[2,2])/sum(table) # 정분류율## [1] 0.9598836
table[1,1]/(table[1,1]+table[1,2]) # 소수집단 정분류율 ## [1] 0.984068
1.3.5 모델2 svm(kernel=“linear”)
- 앞에서 찾은 최적 감마와 코스트를 기반으로 모델 생성
- 모델 생성
m2=svm(factor(y_faulty)
~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
+load_time+highpressure_time,data=train,gamma=2,cost=16,kernel="linear")
summary(m2)##
## Call:
## svm(formula = factor(y_faulty) ~ fix_time + a_speed + b_speed +
## separation + s_separation + rate_terms + mpa + load_time +
## highpressure_time, data = train, gamma = 2, cost = 16, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 16
## gamma: 2
##
## Number of Support Vectors: 3195
##
## ( 1600 1595 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
- 검증 데이터로 예측값 및 정확도 구하기
yhat_test2=predict(m2,test)
table2=table(real=test$y_faulty,predict=yhat_test2)
table2## predict
## real 0 1
## 0 5572 77
## 1 485 397
(table2[1,1]+table2[2,2])/sum(table) # 정분류율## [1] 0.9139489
table2[1,1]/(table2[1,1]+table2[1,2]) # 소수집단 정분류율 ## [1] 0.9863693
1.3.6 모델3 : defalut모델 svm()
- 앞에서 찾은 최적 감마와 코스트를 기반으로 모델 생성
- 모델 생성
m3=svm(factor(y_faulty)
~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
+load_time+highpressure_time,data=train)
summary(m3)##
## Call:
## svm(formula = factor(y_faulty) ~ fix_time + a_speed + b_speed +
## separation + s_separation + rate_terms + mpa + load_time +
## highpressure_time, data = train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.1111111
##
## Number of Support Vectors: 3375
##
## ( 1701 1674 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
- 검증 데이터로 예측값 및 정확도 구하기
yhat_test3=predict(m3,test)
table3=table(real=test$y_faulty,predict=yhat_test3)
table3## predict
## real 0 1
## 0 5592 57
## 1 491 391
(table3[1,1]+table3[2,2])/sum(table3) # 정분류율## [1] 0.9160925
table3[1,1]/(table3[1,1]+table3[1,2]) # 소수집단 정분류율 ## [1] 0.9899097
1.3.7 ROC,AUC
ROC(test=yhat_test,stat=test$y_faulty,plot="ROC",AUC=T,main="SVM")1.3.8 새로운 데이터 예측 (1) 한 개 데이터 예측
new.data=data.frame(fix_time=87,a_speed=0.609,b_speed=1.715,separation=242.7,s_separation=657.5,rate_terms=95,mpa=78,load_time=18.1,highpressure_time=82)
predict(m,newdata = new.data)## 1
## 0
## Levels: 0 1
1.3.9 새로운 데이터 예측 (2) 복수 데이터 예측
- 벡터
new.data=data.frame(fix_time=c(87,85.6),a_speed=c(0.609,0.472),b_speed=c(1.715,1.685),separation=c(242.7,243.4),s_separation=c(657.5,657.9),rate_terms=c(95,95),mpa=c(78,28.8),load_time=c(18.1,18.2),highpressure_time=c(82,60))
predict(m,newdata = new.data)## 1 2
## 0 1
## Levels: 0 1
- 데이터 프레임
new.data2=data.frame(fix_time=test$fix_time,a_speed=test$a_speed,b_speed=test$b_speed,separation=test$separation,s_separation=test$s_separation,rate_terms=test$rate_terms,mpa=test$mpa,load_time=test$load_time,highpressure_time=test$highpressure_time)
table(predict(m,newdata = new.data2))##
## 0 1
## 5731 800
1.4 Support Vector Regression
- 종속변수가 연속형인 경우에도 적용할 수 있다.
- 종속변수가 연속형인 경우에는 회귀분석과 유사한 결과를 내놓는다.
- 서로 다른 분류에 속한 관측치 사이의 간격이 최대가 되는 선을 찾아 이것을 선으로 연결한 것.
바이너리한 값이 아니라 연속된 수치로 예측이 가능하다.
- cost가 높을수록 이상치를 포함할 가능성이 높아서 과대적합 위험
- cost가 작을수록 이상치를 포함할 가능성이 낮아져서 과소적합 위험
- gamma가 클수록 개체간 분산이 작아져서 곡선률이 커지고, 개체간 영향력이 커져서 옆 개체에 미치는 영량력이 커져서 과대적합 위험
gamma가 작을수록 개체간 분산이 커져서 곡선률이 작아지고, 개체간 영향력이 작아져서 옆 개체에 미치는 영량력이 작아져서 과대적합 위험
1.4.1 데이터 준비
autoparts=read.csv("data/autoparts.csv",header = TRUE)
autoparts1=autoparts[autoparts$prod_no=="90784-76001",c(2:11)]
autoparts2=autoparts1[autoparts1$c_thickness<1000,]
autoparts2$y_faulty=ifelse((autoparts2$c_thickness<20)|(autoparts2$c_thickness>32),1,0)
# 정상이면 0 / 불량이면 1
head(autoparts2,10)## fix_time a_speed b_speed separation s_separation rate_terms mpa
## 1 85.5 0.611 1.715 242.0 657.6 95 78.2
## 2 86.2 0.606 1.708 244.7 657.1 95 77.9
## 3 86.0 0.609 1.715 242.7 657.5 95 78.0
## 4 86.1 0.610 1.718 241.9 657.3 95 78.2
## 5 86.1 0.603 1.704 242.5 657.3 95 77.9
## 6 86.3 0.606 1.707 244.5 656.9 95 77.9
## 7 86.5 0.606 1.701 243.1 656.9 95 78.2
## 8 86.4 0.607 1.707 243.1 657.3 95 77.5
## 9 86.3 0.604 1.711 245.2 656.9 95 77.8
## 10 86.0 0.608 1.696 248.0 657.3 95 77.5
## load_time highpressure_time c_thickness y_faulty
## 1 18.1 58 24.7 0
## 2 18.2 58 22.5 0
## 3 18.1 82 24.1 0
## 4 18.1 74 25.1 0
## 5 18.2 56 24.5 0
## 6 18.0 78 22.9 0
## 7 18.1 55 24.3 0
## 8 18.1 57 23.9 0
## 9 18.0 50 22.2 0
## 10 18.1 60 19.0 1
t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7) # 행 인덱스 추출 (70%)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]
nrow(train)/(nrow(train)+nrow(test))## [1] 0.6999587
nrow(test)/(nrow(train)+nrow(test))## [1] 0.3000413
1.4.2 서포트 벡터 회귀 모델 1
- 모델 생성
m=svm(c_thickness~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
+load_time+highpressure_time,data=train,gamma=2,cost=16)
summary(m)##
## Call:
## svm(formula = c_thickness ~ fix_time + a_speed + b_speed + separation +
## s_separation + rate_terms + mpa + load_time + highpressure_time,
## data = train, gamma = 2, cost = 16)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 16
## gamma: 2
## epsilon: 0.1
##
##
## Number of Support Vectors: 2897
- 그래프 확인
yhat_test=predict(m,test)
plot(x=test$c_thickness,y=yhat_test,main="SVR")mse=mean((yhat_test-test$c_thickness^2))
mse## [1] -559.6115
- 다중선형회귀모형과의 비교 (1) 모형생성
m2=lm(c_thickness~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
+load_time+highpressure_time,data=test)
summary(m2)##
## Call:
## lm(formula = c_thickness ~ fix_time + a_speed + b_speed + separation +
## s_separation + rate_terms + mpa + load_time + highpressure_time,
## data = test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.0967 -0.5954 -0.0249 0.5677 26.4781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.150e+02 6.359e+00 112.444 < 2e-16 ***
## fix_time 2.688e-02 9.589e-03 2.803 0.00507 **
## a_speed -1.726e+01 8.127e-01 -21.238 < 2e-16 ***
## b_speed 1.872e+00 2.844e-01 6.583 4.97e-11 ***
## separation -7.537e-01 6.858e-03 -109.913 < 2e-16 ***
## s_separation -7.441e-01 6.935e-03 -107.293 < 2e-16 ***
## rate_terms 5.966e-03 6.715e-03 0.888 0.37433
## mpa -1.603e-01 2.770e-03 -57.877 < 2e-16 ***
## load_time -1.276e-01 1.687e-02 -7.566 4.39e-14 ***
## highpressure_time 3.274e-05 2.877e-05 1.138 0.25516
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.86 on 6521 degrees of freedom
## Multiple R-squared: 0.7743, Adjusted R-squared: 0.774
## F-statistic: 2485 on 9 and 6521 DF, p-value: < 2.2e-16
- 다중선형회귀모형과의 비교 (2) mse, plot 비교
- SVR은 그레프의 양 쯕에서 예측력이 떨어지는 경향이 있다.
yhat_test2=predict(m2,test)
par(mfrow=c(1,2))
plot(x=test$c_thickness,y=yhat_test,main="SVR")
plot(x=test$c_thickness,y=yhat_test2,main="lm",xlab="실제값",ylab="예측값")mse=mean((yhat_test2-test$c_thickness)^2) # lm의 mse가 SVR보다 2배이상크다 - 단순선형 회귀모형과의 비교
regression=read.csv("data/regression.csv",header = TRUE)
plot(regression$x,regression$y,pch=16,xlab="실제값",ylab="예측값")
# LM
m1=lm(y~x,data=regression)
p1=predict(m1,newdata=regression)
points(regression$x,p1,col="blue",pch="L")
# SVR
m2=svm(y~x,regression)
p2=predict(m2,newdata=as.matrix(regression))
points(regression$x,p2,col="red",pch="S")1.4.3 연습문제
# 1개
new.data=data.frame(x=10)
predict(m2,newdata=new.data)## 1
## 25.92276
# 2개
new.data=data.frame(x=c(10,20))
predict(m2,newdata=new.data)## 1 2
## 25.92276 59.96128
1.5 로지스틱 회귀분석
- 종속변수가 이분형이거나 범주형인 경우
- 오즈비 : 오즈/(1-오즈)
- 오즈 = 조건부 확률, x가 주어졌을 때 y가 성공일 확률
1.5.1 데이터 준비
autoparts=read.csv("data/autoparts.csv",header = TRUE)
autoparts1=autoparts[autoparts$prod_no=="90784-76001",c(2:11)]
autoparts2=autoparts1[autoparts1$c_thickness<1000,]
autoparts2$y_faulty=ifelse(autoparts2$c_thickness<20|autoparts2$c_thickness>32,1,0)
autoparts2$y_faulty=as.factor(autoparts2$y_faulty) # factor로 형변환 1.5.2 테이블 작성
table=table(autoparts2$y_faulty)
table[2]/sum(table[1]+table[2])*100 # 불량률 ## 1
## 13.05646
1.5.3 모형 생성
m=glm(y_faulty~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
+load_time+highpressure_time,data=autoparts2,family = binomial(logit))
summary(m)##
## Call:
## glm(formula = y_faulty ~ fix_time + a_speed + b_speed + separation +
## s_separation + rate_terms + mpa + load_time + highpressure_time,
## family = binomial(logit), data = autoparts2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.3641 -0.3738 -0.2150 -0.1184 5.1771
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.511e+02 9.981e+00 -45.197 < 2e-16 ***
## fix_time -3.034e-02 9.623e-03 -3.153 0.001617 **
## a_speed 1.965e+01 9.472e-01 20.743 < 2e-16 ***
## b_speed -1.854e+00 3.970e-01 -4.670 3.01e-06 ***
## separation 5.322e-01 1.121e-02 47.476 < 2e-16 ***
## s_separation 4.957e-01 1.094e-02 45.320 < 2e-16 ***
## rate_terms -2.332e-02 6.817e-03 -3.422 0.000623 ***
## mpa -1.416e-01 3.367e-03 -42.043 < 2e-16 ***
## load_time 1.835e-03 1.508e-02 0.122 0.903124
## highpressure_time 1.787e-04 1.863e-05 9.595 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16867.6 on 21766 degrees of freedom
## Residual deviance: 9993.8 on 21757 degrees of freedom
## AIC: 10014
##
## Number of Fisher Scoring iterations: 6
1.5.4 데이터셋 나누기 (train:test=7:3)
t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]1.5.5 train model 생성
m=glm(y_faulty~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
+load_time+highpressure_time,data=train,family = binomial(logit))
summary(m)##
## Call:
## glm(formula = y_faulty ~ fix_time + a_speed + b_speed + separation +
## s_separation + rate_terms + mpa + load_time + highpressure_time,
## family = binomial(logit), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.4592 -0.3680 -0.2067 -0.1102 4.5351
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.699e+02 1.228e+01 -38.275 < 2e-16 ***
## fix_time -3.090e-02 1.174e-02 -2.631 0.00851 **
## a_speed 2.090e+01 1.144e+00 18.273 < 2e-16 ***
## b_speed -2.151e+00 4.914e-01 -4.377 1.2e-05 ***
## separation 5.543e-01 1.380e-02 40.171 < 2e-16 ***
## s_separation 5.158e-01 1.342e-02 38.432 < 2e-16 ***
## rate_terms -2.502e-02 8.234e-03 -3.038 0.00238 **
## mpa -1.470e-01 4.113e-03 -35.746 < 2e-16 ***
## load_time 3.012e-02 1.877e-02 1.605 0.10854
## highpressure_time 1.868e-04 2.106e-05 8.867 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 11824.5 on 15235 degrees of freedom
## Residual deviance: 6878.8 on 15226 degrees of freedom
## AIC: 6898.8
##
## Number of Fisher Scoring iterations: 6
head(m$fitted.values)## 3424 9307 25135 17684 1601 4643
## 0.13208652 0.11175087 0.01575692 0.01822328 0.08647116 0.01579807
1.5.6 기준값 설정
yhat=ifelse(m$fitted.values>=0.5,1,0)
head(yhat)## 3424 9307 25135 17684 1601 4643
## 0 0 0 0 0 0
1.5.7 빈도표 작성
table=table(real=train$y_faulty,predict=yhat)
table## predict
## real 0 1
## 0 13003 239
## 1 1076 918
1.5.8 test데이터를 모델에 넣어 예측값 구하기
yhat_test=predict(m,test,type="response")
head(yhat_test,n=10)## 1 5 11 13 16 22
## 0.02955854 0.02976147 0.39680302 0.23728760 0.84215663 0.30368969
## 27 28 29 40
## 0.07477993 0.03763883 0.26040496 0.06784719
1.5.9 ROC & AUC
ROC(test=yhat_test,stat = test$y_faulty,plot="ROC",AUC=T,main="logistics regression")1.5.10 새로운 데이터 예측 (1) 한 개 데이터 예측
new.data=data.frame(fix_time=87,a_speed=0.609,b_speed=1.715,separation=242.7,s_separation=657.5,rate_terms=95,mpa=78,load_time=18.1,highpressure_time=82)
possibilityof1=predict(m,newdata=new.data,type="response")
ifelse(possibilityof1>=0.5,1,0)## 1
## 0
1.5.11 새로운 데이터 예측 (2) 복수 데이터 예측
- 벡터
new.data=data.frame(fix_time=c(87,85.6),a_speed=c(0.609,0.472),b_speed=c(1.715,1.685)
,separation=c(242.7,243.4),s_separation=c(657.5,657.9)
,rate_terms=c(95,95),mpa=c(78,28.8),load_time=c(18.1,18.2)
,highpressure_time=c(82,60))
possibilityof1=predict(m,newdata=new.data,type="response")
ifelse(possibilityof1>=0.5,1,0)## 1 2
## 0 1
- 데이터 프레임
new.data2=data.frame(fix_time=test$fix_time,a_speed=test$a_speed,b_speed=test$b_speed,separation=test$separation,s_separation=test$s_separation,rate_terms=test$rate_terms,mpa=test$mpa,load_time=test$load_time,highpressure_time=test$highpressure_time)
possibilityof1=predict(m,newdata=new.data2,type="response")
head(ifelse(possibilityof1>=0.5,1,0),5)## 1 2 3 4 5
## 0 0 0 0 1
1.6 다항 로지스틱 회귀분석
1.6.1 다항 종속변수 생성
autoparts2$g_class=as.factor(ifelse(autoparts2$c_thickness<20,1,ifelse(autoparts2$c_thickness<32,2,3)))
table(autoparts2$g_class)##
## 1 2 3
## 2141 18914 712
1.6.2 데이터셋 나누기(train:test=7:3)
t_index=sample(1:nrow(autoparts2),size=nrow(autoparts2)*0.7) # 행 인덱스 추출 (70%)
train=autoparts2[t_index,]
test=autoparts2[-t_index,]1.6.3 train model 생성
m=multinom(g_class~fix_time+a_speed+b_speed+separation+s_separation+rate_terms+mpa
+load_time+highpressure_time,data=train)## # weights: 33 (20 variable)
## initial value 16738.456830
## iter 10 value 5406.520958
## iter 20 value 4705.962846
## iter 30 value 3734.374204
## iter 40 value 3098.445042
## iter 50 value 3051.399959
## iter 60 value 2971.597886
## iter 70 value 2545.453547
## iter 80 value 1961.458206
## iter 90 value 1960.001423
## final value 1959.994245
## converged
summary(m)## Call:
## multinom(formula = g_class ~ fix_time + a_speed + b_speed + separation +
## s_separation + rate_terms + mpa + load_time + highpressure_time,
## data = train)
##
## Coefficients:
## (Intercept) fix_time a_speed b_speed separation s_separation
## 2 1792.723 0.1887038 -15.13607 4.145293 -1.934316 -1.946824
## 3 1884.479 0.1901040 -25.65636 3.712879 -2.016982 -2.037163
## rate_terms mpa load_time highpressure_time
## 2 0.08317144 -0.6875959 -0.1520803 0.0002742946
## 3 0.06217785 -0.8129465 -0.1678857 0.0003011845
##
## Std. Errors:
## (Intercept) fix_time a_speed b_speed separation
## 2 1.270423e-05 0.04913755 0.001059229 2.507332e-04 0.007366989
## 3 1.859757e-05 0.05019033 0.000202918 2.348695e-05 0.007697641
## s_separation rate_terms mpa load_time highpressure_time
## 2 0.004573189 0.01451830 0.01333213 0.07239175 0.0001547160
## 3 0.004920469 0.02227692 0.01426074 0.07394874 0.0001572982
##
## Residual Deviance: 3919.988
## AIC: 3959.988
head(m$fitted.values)## 1 2 3
## 23516 6.557632e-05 0.9991734 0.0007609939
## 2540 1.364256e-03 0.9951095 0.0035261950
## 25820 6.068365e-04 0.9986313 0.0007618756
## 2641 2.993092e-05 0.9962637 0.0037063585
## 1970 3.932116e-07 0.5680646 0.4319349761
## 33215 5.137565e-05 0.9980095 0.0019390832
1.6.4 test데이터로 예측값 및 정확도 구하기
yhat_test=predict(m,test)
table=table(real=test$g_class,predict=yhat_test)
table## predict
## real 1 2 3
## 1 491 142 1
## 2 64 5535 99
## 3 0 35 164
(table[1.1]+table[2,2]+table[3,3])/sum(table)## [1] 0.9477875
1.6.5 연습문제 1
- occupancy데이터 로지스틱 적합 / test데이터 예측값과 정확도 구하기 / ROC, AUC 구하기
- 데이터 불러오기
train=read.csv("data/occupancy_train.csv")
head(train) # 변수확인## date Temperature Humidity Light CO2 HumidityRatio
## 1 2015-02-04 17:51:00 23.18 27.2720 426.0 721.25 0.004792988
## 2 2015-02-04 17:51:59 23.15 27.2675 429.5 714.00 0.004783441
## 3 2015-02-04 17:53:00 23.15 27.2450 426.0 713.50 0.004779464
## 4 2015-02-04 17:54:00 23.15 27.2000 426.0 708.25 0.004771509
## 5 2015-02-04 17:55:00 23.10 27.2000 426.0 704.50 0.004756993
## 6 2015-02-04 17:55:59 23.10 27.2000 419.0 701.00 0.004756993
## Occupancy
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
- 로지스틱 모형 적합
m=glm(Occupancy~Light+CO2,data=train)
summary(m)##
## Call:
## glm(formula = Occupancy ~ Light + CO2, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.50187 0.00353 0.02293 0.02650 0.76880
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.376e-01 4.196e-03 -32.79 <2e-16 ***
## Light 1.632e-03 1.226e-05 133.07 <2e-16 ***
## CO2 2.554e-04 7.598e-06 33.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.02596176)
##
## Null deviance: 1361.88 on 8142 degrees of freedom
## Residual deviance: 211.33 on 8140 degrees of freedom
## AIC: -6617.3
##
## Number of Fisher Scoring iterations: 2
- test 데이터의 예측값과 정확도 구하기
- 기준 값 설정
yhat=ifelse(m$fitted.values>=0.5,1,0)
head(yhat)## 1 2 3 4 5 6
## 1 1 1 1 1 1
- 빈도표 작성
table=table(real=train$Occupancy,predict=yhat)
table## predict
## real 0 1
## 0 6227 187
## 1 3 1726
- test데이터를 모델에 넣어 예측값 구하기
test=read.csv("data/occupancy_test.csv")
yhat_test=predict(m,test,type="response")
head(yhat_test,n=10)## 1 2 3 4 5 6 7
## 1.0086226 1.0003872 0.9933986 0.8659269 0.8586089 0.9920646 0.9413488
## 8 9 10
## 0.8964930 0.8442296 0.9011894
- ROC & AUC
ROC(test=yhat_test,stat = test$Occupancy,plot="ROC",AUC=T,main="logistics regression")1.6.5 연습문제 2
- occupancy데이터 SVM 적합(g=default,cost=10) / test데이터 예측값과 정확도 구하기 / ROC, AUC 구하기
- 데이터 불러오기
train=read.csv("data/occupancy_train.csv")
test=read.csv("data/occupancy_test.csv")
head(train) # 변수확인## date Temperature Humidity Light CO2 HumidityRatio
## 1 2015-02-04 17:51:00 23.18 27.2720 426.0 721.25 0.004792988
## 2 2015-02-04 17:51:59 23.15 27.2675 429.5 714.00 0.004783441
## 3 2015-02-04 17:53:00 23.15 27.2450 426.0 713.50 0.004779464
## 4 2015-02-04 17:54:00 23.15 27.2000 426.0 708.25 0.004771509
## 5 2015-02-04 17:55:00 23.10 27.2000 426.0 704.50 0.004756993
## 6 2015-02-04 17:55:59 23.10 27.2000 419.0 701.00 0.004756993
## Occupancy
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
- svm 모형 적합
m=svm(as.factor(Occupancy)~Light+CO2,data=train,cost=10,kernel="linear")
summary(m)##
## Call:
## svm(formula = as.factor(Occupancy) ~ Light + CO2, data = train,
## cost = 10, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 10
## gamma: 0.5
##
## Number of Support Vectors: 443
##
## ( 222 221 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
- test 데이터의 예측값과 정확도 구하기
yhat_test=predict(m,test)
table=table(real=test$Occupancy,predict=yhat_test)
table## predict
## real 0 1
## 0 1638 55
## 1 3 969
(table[1,1]+table[2,2])/sum(table) # 정분류율## [1] 0.9782364
table[1,1]/(table[1,1]+table[1,2]) # 소수집단 정분류율 ## [1] 0.9675133
- ROC,AUC
ROC(test=yhat_test,stat=test$Occupancy,plot="ROC",AUC=T,main="SVM")- LASSO
train=read.csv("data/occupancy_train.csv")
xmat=as.matrix(train[c(2,3,4,5)])
head(xmat)## Temperature Humidity Light CO2
## [1,] 23.18 27.2720 426.0 721.25
## [2,] 23.15 27.2675 429.5 714.00
## [3,] 23.15 27.2450 426.0 713.50
## [4,] 23.15 27.2000 426.0 708.25
## [5,] 23.10 27.2000 426.0 704.50
## [6,] 23.10 27.2000 419.0 701.00
yvec=train$Occupancy
library(glmnet)## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
fit.lasso=glmnet(x=xmat,y=yvec,alpha = 1,nlambda = 100) # 람다 100개 생성
fit.lasso.cv=cv.glmnet(x=xmat,y=yvec,nfolds = 10,alpha=1,lambda = fit.lasso$lambda)
plot(fit.lasso.cv)fit.lasso.param=fit.lasso.cv$lambda.min # 최적의 람다를 다른 이름으로 저장
fit.lasso.tune=glmnet(x=xmat,y=yvec,alpha=1,lambda = fit.lasso.param) # 최적 람다를 이용한 최종 LASSO 모델
coef(fit.lasso.tune)## 5 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 1.0788878970
## Temperature -0.0592480903
## Humidity -0.0020021089
## Light 0.0017578326
## CO2 0.0003239741