기계학습 정리 과제

1단계: 필요한 패키지를 설치하고 불러오기

설치

install.packages("mlbench")

## package 'mlbench' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages

install.packages("e1071") # Naive Bayes용

## package 'e1071' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages

install.packages("C50")     # 의사결정나무용

## package 'C50' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages

install.packages("randomForest")     # 랜덤포레스트용

## package 'randomForest' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages

install.packages("caret")

## package 'caret' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages

install.packages("pROC")     # ROC 그리기용

## package 'pROC' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'pROC'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Program Files\R\R-4.5.0\library\00LOCK\pROC\libs\x64\pROC.dll to C:\Program
## Files\R\R-4.5.0\library\pROC\libs\x64\pROC.dll: Permission denied

## Warning: restored 'pROC'

## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages

install.packages("gmodels")

## package 'gmodels' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Administrator\AppData\Local\Temp\RtmpuKt89c\downloaded_packages

불러오기

library(mlbench)
library(e1071)     # Naive Bayes용
library(C50)     # 의사결정나무용
library(randomForest)     # 랜덤포레스트용

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

library(caret)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

## Loading required package: lattice

library(pROC)     # ROC 그리기용

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(gmodels)

## 
## Attaching package: 'gmodels'

## The following object is masked from 'package:pROC':
## 
##     ci

2단계: 데이터 불러오기

데이터셋 불러오기

data(BreastCancer)

데이터 확인

head(BreastCancer)

##        Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
## 1 1000025            5         1          1             1            2
## 2 1002945            5         4          4             5            7
## 3 1015425            3         1          1             1            2
## 4 1016277            6         8          8             1            3
## 5 1017023            4         1          1             3            2
## 6 1017122            8        10         10             8            7
##   Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses     Class
## 1           1           3               1       1    benign
## 2          10           3               2       1    benign
## 3           2           3               1       1    benign
## 4           4           3               7       1    benign
## 5           1           3               1       1    benign
## 6          10           9               7       1 malignant

str(BreastCancer)

## 'data.frame':    699 obs. of  11 variables:
##  $ Id             : chr  "1000025" "1002945" "1015425" "1016277" ...
##  $ Cl.thickness   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 5 5 3 6 4 8 1 2 2 4 ...
##  $ Cell.size      : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 1 1 2 ...
##  $ Cell.shape     : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 4 1 8 1 10 1 2 1 1 ...
##  $ Marg.adhesion  : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 5 1 1 3 8 1 1 1 1 ...
##  $ Epith.c.size   : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 2 7 2 3 2 7 2 2 2 2 ...
##  $ Bare.nuclei    : Factor w/ 10 levels "1","2","3","4",..: 1 10 2 4 1 10 10 1 1 1 ...
##  $ Bl.cromatin    : Factor w/ 10 levels "1","2","3","4",..: 3 3 3 3 3 9 3 3 1 2 ...
##  $ Normal.nucleoli: Factor w/ 10 levels "1","2","3","4",..: 1 2 1 7 1 7 1 1 1 1 ...
##  $ Mitoses        : Factor w/ 9 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 5 1 ...
##  $ Class          : Factor w/ 2 levels "benign","malignant": 1 1 1 1 1 2 1 1 1 1 ...

Q. 어떤 함수로 데이터를 불러왔고, 어떤 객체 이름인가요?

A. data() 함수를 사용해서 BreastCancer 데이터를 불러왔으며 객체 이름은 BreastCancer입니다.

3단계: 데이터 전처리

(1) ID 열 제거

bc <- BreastCancer[-1]

(2) 문자형 → 숫자형 및 결측치 제거

bc <- cbind(lapply(bc[-10], function(x) as.numeric(as.character(x))), bc[10])

4단계: 데이터 분할

set.seed(123)
train_idx <- sample(nrow(bc),0.7*nrow(bc))
bc.train <- bc[train_idx, ]
bc.test <- bc[-train_idx, ]

결측값 제거

bc.train <- na.omit(bc.train)
bc.test <- na.omit(bc.test)

5단계: 모델 만들기

Naive Bayes

model.nb <- naiveBayes(Class ~ ., data = bc.train)
pred.nb <- predict(model.nb, bc.test)
prob.nb <- predict(model.nb, bc.test, type = "raw")[,2]
cm.nb <- confusionMatrix(pred.nb, bc.test$Class, positive = "malignant")
roc.nb <- roc(bc.test$Class, prob.nb)

## Setting levels: control = benign, case = malignant

## Setting direction: controls < cases

Decision Tree

model.dt <- C5.0(Class ~ ., data = bc.train, method = "class")
pred.dt <- predict(model.dt, bc.test, type = "class")
prob.dt <- predict(model.dt, bc.test, type = "prob")[,2]
cm.dt <- confusionMatrix(pred.dt, bc.test$Class, positive = "malignant")
roc.dt <- roc(bc.test$Class, prob.dt)

## Setting levels: control = benign, case = malignant

## Setting direction: controls < cases

Random Forest

model.rf <- randomForest(Class ~ ., data = bc.train)
pred.rf <- predict(model.rf, bc.test)
prob.rf <- predict(model.rf, bc.test, type = "prob")[,2]
cm.rf <- confusionMatrix(pred.rf, bc.test$Class, positive = "malignant")
roc.rf <- roc(bc.test$Class, prob.rf)

## Setting levels: control = benign, case = malignant

## Setting direction: controls < cases

SVM

model.svm <- svm(Class ~ ., data = bc.train, probability = TRUE)
pred.svm <- predict(model.svm, bc.test)
prob.svm <- attr(predict(model.svm, bc.test, probability = TRUE), "probabilities")[, "malignant"]
cm.svm <- confusionMatrix(pred.svm, bc.test$Class, positive = "malignant")
roc.svm <- roc(bc.test$Class, prob.svm)

## Setting levels: control = benign, case = malignant

## Setting direction: controls < cases

6단계: 성능 비교표 만들기

results <- data.frame(
  Model = c("Naive Bayes", "Decision Tree", "Random Forest", "SVM"),
  Accuracy = c(cm.nb$overall["Accuracy"],
               cm.dt$overall["Accuracy"],
               cm.rf$overall["Accuracy"],
               cm.svm$overall["Accuracy"]),
  Sensitivity = c(cm.nb$byClass["Sensitivity"],
                  cm.dt$byClass["Sensitivity"],
                  cm.rf$byClass["Sensitivity"],
                  cm.svm$byClass["Sensitivity"]),
  Specificity = c(cm.nb$byClass["Specificity"],
                  cm.dt$byClass["Specificity"],
                  cm.rf$byClass["Specificity"],
                  cm.svm$byClass["Specificity"]),
  AUC = c(auc(roc.nb), auc(roc.dt), auc(roc.rf), auc(roc.svm))
)
print(results)

##           Model  Accuracy Sensitivity Specificity       AUC
## 1   Naive Bayes 0.9420290   1.0000000   0.9189189 0.9801878
## 2 Decision Tree 0.9371981   0.9152542   0.9459459 0.9694228
## 3 Random Forest 0.9710145   1.0000000   0.9594595 0.9853413
## 4           SVM 0.9613527   0.9830508   0.9527027 0.9843106

7단계: ROC 곡선 그리기

plot(roc.nb, col = "red", main = "ROC Curves")
lines(roc.dt, col = "blue")
lines(roc.rf, col = "green")
lines(roc.svm, col = "purple")
legend("bottomright", legend = c("Naive Bayes", "Decision Tree", "Random Forest", "SVM"),
       col = c("red", "blue", "green", "purple"), lty = 1)

기계학습 정리 과제

20220124

2025-05-10

1단계: 필요한 패키지를 설치하고 불러오기

설치

불러오기

2단계: 데이터 불러오기

데이터셋 불러오기

데이터 확인

Q. 어떤 함수로 데이터를 불러왔고, 어떤 객체 이름인가요?

A. data() 함수를 사용해서 BreastCancer 데이터를 불러왔으며 객체 이름은 BreastCancer입니다.

3단계: 데이터 전처리

(1) ID 열 제거

(2) 문자형 → 숫자형 및 결측치 제거

4단계: 데이터 분할

결측값 제거

5단계: 모델 만들기

Naive Bayes

Decision Tree

Random Forest

SVM

6단계: 성능 비교표 만들기

7단계: ROC 곡선 그리기

Q. 그래프에서 가장 넓은 곡선 면적을 가진 모델은 무엇인가요?

A. AUC (ROC 아래 면적) 값을 비교했을때 값이 가장 높은 모델은 RandomForest이기 때문에 RandomForest가 가장 넓은 곡선 면적을 가지고 있습니다.