1단계: 필요한 패키지를 설치하고 불러오기

설치

install.packages("mlbench")
## package 'mlbench' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
install.packages("e1071") # Naive Bayes용
## package 'e1071' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
install.packages("C50")     # 의사결정나무용
## package 'C50' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
install.packages("randomForest")     # 랜덤포레스트용
## package 'randomForest' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
install.packages("caret")
## package 'caret' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
install.packages("pROC")     # ROC 그리기용
## package 'pROC' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'pROC'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Program Files\R\R-4.5.0\library\00LOCK\pROC\libs\x64\pROC.dll to C:\Program
## Files\R\R-4.5.0\library\pROC\libs\x64\pROC.dll: Permission denied
## Warning: restored 'pROC'
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages
install.packages("gmodels")
## package 'gmodels' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages

불러오기

library(mlbench)
library(e1071)     # Naive Bayes용
library(C50)     # 의사결정나무용
library(randomForest)     # 랜덤포레스트용
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
library(pROC)     # ROC 그리기용
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(gmodels)
## 
## Attaching package: 'gmodels'
## The following object is masked from 'package:pROC':
## 
##     ci

2단계: 데이터 불러오기

데이터셋 불러오기

data(BreastCancer)

데이터 확인

head(BreastCancer)
##        Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 1 1000025            5         1          1             1            2
## 2 1002945            5         4          4             5            7
## 3 1015425            3         1          1             1            2
## 4 1016277            6         8          8             1            3
## 5 1017023            4         1          1             3            2
## 6 1017122            8        10         10             8            7
##   Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses     Class
## 1           1           3               1       1    benign
## 2          10           3               2       1    benign
## 3           2           3               1       1    benign
## 4           4           3               7       1    benign
## 5           1           3               1       1    benign
## 6          10           9               7       1 malignant
str(BreastCancer)
## 'data.frame':    699 obs. of  11 variables:
##  $ Id             : chr  "1000025" "1002945" "1015425" "1016277" ...
##  $ Cl.thickness   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell.size      : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell.shape     : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
##  $ Marg.adhesion  : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
##  $ Epith.c.size   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare.nuclei    : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
##  $ Bl.cromatin    : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

Q. 어떤 함수로 데이터를 불러왔고, 어떤 객체 이름인가요?

A. data() 함수를 사용해서 BreastCancer 데이터를 불러왔으며 객체 이름은 BreastCancer입니다.

3단계: 데이터 전처리

(1) ID 열 제거

bc <- BreastCancer[-1]

(2) 문자형 → 숫자형 및 결측치 제거

bc <- cbind(lapply(bc[-10], function(x) as.numeric(as.character(x))), bc[10])

4단계: 데이터 분할

set.seed(123)
train_idx <- sample(nrow(bc),0.7*nrow(bc))
bc.train <- bc[train_idx, ]
bc.test <- bc[-train_idx, ]

결측값 제거

bc.train <- na.omit(bc.train)
bc.test <- na.omit(bc.test)

5단계: 모델 만들기

Naive Bayes

model.nb <- naiveBayes(Class ~ ., data = bc.train)
pred.nb <- predict(model.nb, bc.test)
prob.nb <- predict(model.nb, bc.test, type = "raw")[,2]
cm.nb <- confusionMatrix(pred.nb, bc.test$Class, positive = "malignant")
roc.nb <- roc(bc.test$Class, prob.nb)
## Setting levels: control = benign, case = malignant
## Setting direction: controls < cases

Decision Tree

model.dt <- C5.0(Class ~ ., data = bc.train, method = "class")
pred.dt <- predict(model.dt, bc.test, type = "class")
prob.dt <- predict(model.dt, bc.test, type = "prob")[,2]
cm.dt <- confusionMatrix(pred.dt, bc.test$Class, positive = "malignant")
roc.dt <- roc(bc.test$Class, prob.dt)
## Setting levels: control = benign, case = malignant
## Setting direction: controls < cases

Random Forest

model.rf <- randomForest(Class ~ ., data = bc.train)
pred.rf <- predict(model.rf, bc.test)
prob.rf <- predict(model.rf, bc.test, type = "prob")[,2]
cm.rf <- confusionMatrix(pred.rf, bc.test$Class, positive = "malignant")
roc.rf <- roc(bc.test$Class, prob.rf)
## Setting levels: control = benign, case = malignant
## Setting direction: controls < cases

SVM

model.svm <- svm(Class ~ ., data = bc.train, probability = TRUE)
pred.svm <- predict(model.svm, bc.test)
prob.svm <- attr(predict(model.svm, bc.test, probability = TRUE), "probabilities")[, "malignant"]
cm.svm <- confusionMatrix(pred.svm, bc.test$Class, positive = "malignant")
roc.svm <- roc(bc.test$Class, prob.svm)
## Setting levels: control = benign, case = malignant
## Setting direction: controls < cases

6단계: 성능 비교표 만들기

results <- data.frame(
  Model = c("Naive Bayes", "Decision Tree", "Random Forest", "SVM"),
  Accuracy = c(cm.nb$overall["Accuracy"],
               cm.dt$overall["Accuracy"],
               cm.rf$overall["Accuracy"],
               cm.svm$overall["Accuracy"]),
  Sensitivity = c(cm.nb$byClass["Sensitivity"],
                  cm.dt$byClass["Sensitivity"],
                  cm.rf$byClass["Sensitivity"],
                  cm.svm$byClass["Sensitivity"]),
  Specificity = c(cm.nb$byClass["Specificity"],
                  cm.dt$byClass["Specificity"],
                  cm.rf$byClass["Specificity"],
                  cm.svm$byClass["Specificity"]),
  AUC = c(auc(roc.nb), auc(roc.dt), auc(roc.rf), auc(roc.svm))
)
print(results)
##           Model  Accuracy Sensitivity Specificity       AUC
## 1   Naive Bayes 0.9420290   1.0000000   0.9189189 0.9801878
## 2 Decision Tree 0.9371981   0.9152542   0.9459459 0.9694228
## 3 Random Forest 0.9710145   1.0000000   0.9594595 0.9853413
## 4           SVM 0.9613527   0.9830508   0.9527027 0.9843106

7단계: ROC 곡선 그리기

plot(roc.nb, col = "red", main = "ROC Curves")
lines(roc.dt, col = "blue")
lines(roc.rf, col = "green")
lines(roc.svm, col = "purple")
legend("bottomright", legend = c("Naive Bayes", "Decision Tree", "Random Forest", "SVM"),
       col = c("red", "blue", "green", "purple"), lty = 1)

Q. 그래프에서 가장 넓은 곡선 면적을 가진 모델은 무엇인가요?

A. AUC (ROC 아래 면적) 값을 비교했을때 값이 가장 높은 모델은 RandomForest이기 때문에 RandomForest가 가장 넓은 곡선 면적을 가지고 있습니다.