1단계: 필요한 패키지를 설치하고 불러오기
설치
install.packages("mlbench")
## package 'mlbench' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
install.packages("e1071") # Naive Bayes용
## package 'e1071' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
install.packages("C50") # 의사결정나무용
## package 'C50' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
install.packages("randomForest") # 랜덤포레스트용
## package 'randomForest' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
install.packages("caret")
## package 'caret' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
install.packages("pROC") # ROC 그리기용
## package 'pROC' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'pROC'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Program Files\R\R-4.5.0\library\00LOCK\pROC\libs\x64\pROC.dll to C:\Program
## Files\R\R-4.5.0\library\pROC\libs\x64\pROC.dll: Permission denied
## Warning: restored 'pROC'
##
## The downloaded binary packages are in
## C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
install.packages("gmodels")
## package 'gmodels' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
불러오기
library(mlbench)
library(e1071) # Naive Bayes용
library(C50) # 의사결정나무용
library(randomForest) # 랜덤포레스트용
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
library(pROC) # ROC 그리기용
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(gmodels)
##
## Attaching package: 'gmodels'
## The following object is masked from 'package:pROC':
##
## ci
2단계: 데이터 불러오기
데이터셋 불러오기
data(BreastCancer)
데이터 확인
head(BreastCancer)
## Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 1 1000025 5 1 1 1 2
## 2 1002945 5 4 4 5 7
## 3 1015425 3 1 1 1 2
## 4 1016277 6 8 8 1 3
## 5 1017023 4 1 1 3 2
## 6 1017122 8 10 10 8 7
## Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
## 1 1 3 1 1 benign
## 2 10 3 2 1 benign
## 3 2 3 1 1 benign
## 4 4 3 7 1 benign
## 5 1 3 1 1 benign
## 6 10 9 7 1 malignant
str(BreastCancer)
## 'data.frame': 699 obs. of 11 variables:
## $ Id : chr "1000025" "1002945" "1015425" "1016277" ...
## $ Cl.thickness : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
## $ Cell.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
## $ Cell.shape : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
## $ Marg.adhesion : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
## $ Epith.c.size : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
## $ Bare.nuclei : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
## $ Bl.cromatin : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
## $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...
Q. 어떤 함수로 데이터를 불러왔고, 어떤 객체 이름인가요?
A. data() 함수를 사용해서 BreastCancer 데이터를 불러왔으며 객체
이름은 BreastCancer입니다.
3단계: 데이터 전처리
(1) ID 열 제거
bc <- BreastCancer[-1]
(2) 문자형 → 숫자형 및 결측치 제거
bc <- cbind(lapply(bc[-10], function(x) as.numeric(as.character(x))), bc[10])
4단계: 데이터 분할
set.seed(123)
train_idx <- sample(nrow(bc),0.7*nrow(bc))
bc.train <- bc[train_idx, ]
bc.test <- bc[-train_idx, ]
결측값 제거
bc.train <- na.omit(bc.train)
bc.test <- na.omit(bc.test)
5단계: 모델 만들기
Naive Bayes
model.nb <- naiveBayes(Class ~ ., data = bc.train)
pred.nb <- predict(model.nb, bc.test)
prob.nb <- predict(model.nb, bc.test, type = "raw")[,2]
cm.nb <- confusionMatrix(pred.nb, bc.test$Class, positive = "malignant")
roc.nb <- roc(bc.test$Class, prob.nb)
## Setting levels: control = benign, case = malignant
## Setting direction: controls < cases
Decision Tree
model.dt <- C5.0(Class ~ ., data = bc.train, method = "class")
pred.dt <- predict(model.dt, bc.test, type = "class")
prob.dt <- predict(model.dt, bc.test, type = "prob")[,2]
cm.dt <- confusionMatrix(pred.dt, bc.test$Class, positive = "malignant")
roc.dt <- roc(bc.test$Class, prob.dt)
## Setting levels: control = benign, case = malignant
## Setting direction: controls < cases
Random Forest
model.rf <- randomForest(Class ~ ., data = bc.train)
pred.rf <- predict(model.rf, bc.test)
prob.rf <- predict(model.rf, bc.test, type = "prob")[,2]
cm.rf <- confusionMatrix(pred.rf, bc.test$Class, positive = "malignant")
roc.rf <- roc(bc.test$Class, prob.rf)
## Setting levels: control = benign, case = malignant
## Setting direction: controls < cases
SVM
model.svm <- svm(Class ~ ., data = bc.train, probability = TRUE)
pred.svm <- predict(model.svm, bc.test)
prob.svm <- attr(predict(model.svm, bc.test, probability = TRUE), "probabilities")[, "malignant"]
cm.svm <- confusionMatrix(pred.svm, bc.test$Class, positive = "malignant")
roc.svm <- roc(bc.test$Class, prob.svm)
## Setting levels: control = benign, case = malignant
## Setting direction: controls < cases
6단계: 성능 비교표 만들기
results <- data.frame(
Model = c("Naive Bayes", "Decision Tree", "Random Forest", "SVM"),
Accuracy = c(cm.nb$overall["Accuracy"],
cm.dt$overall["Accuracy"],
cm.rf$overall["Accuracy"],
cm.svm$overall["Accuracy"]),
Sensitivity = c(cm.nb$byClass["Sensitivity"],
cm.dt$byClass["Sensitivity"],
cm.rf$byClass["Sensitivity"],
cm.svm$byClass["Sensitivity"]),
Specificity = c(cm.nb$byClass["Specificity"],
cm.dt$byClass["Specificity"],
cm.rf$byClass["Specificity"],
cm.svm$byClass["Specificity"]),
AUC = c(auc(roc.nb), auc(roc.dt), auc(roc.rf), auc(roc.svm))
)
print(results)
## Model Accuracy Sensitivity Specificity AUC
## 1 Naive Bayes 0.9420290 1.0000000 0.9189189 0.9801878
## 2 Decision Tree 0.9371981 0.9152542 0.9459459 0.9694228
## 3 Random Forest 0.9710145 1.0000000 0.9594595 0.9853413
## 4 SVM 0.9613527 0.9830508 0.9527027 0.9843106
7단계: ROC 곡선 그리기
plot(roc.nb, col = "red", main = "ROC Curves")
lines(roc.dt, col = "blue")
lines(roc.rf, col = "green")
lines(roc.svm, col = "purple")
legend("bottomright", legend = c("Naive Bayes", "Decision Tree", "Random Forest", "SVM"),
col = c("red", "blue", "green", "purple"), lty = 1)

Q. 그래프에서 가장 넓은 곡선 면적을 가진 모델은 무엇인가요?
A. AUC (ROC 아래 면적) 값을 비교했을때 값이 가장 높은 모델은
RandomForest이기 때문에 RandomForest가 가장 넓은 곡선 면적을 가지고
있습니다.