randomForest를 사용한 iris 분류

이번에는 랜던 포레스트로 iris 데이터를 분류해보고자 한다. 앙상블 기법의 일종으로 여러가지 기술을 가진 의사결정 나무들이 모여있는 형태라고 볼 수 있는 모델이다.
이제 랜덤 포레스트를 통한 iris의 spices를 분류해볼 것이다.

package load

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

데이터 파악

이번에는 데이터를 파악해보자 각각 데이터의 자료형이 어떤지, 어떻게 분포되어 있는지 등을 볼 것이다.

df <- iris

str(df)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

summary(df)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

sum(is.na(df))

## [1] 0

plot(df)

간단히 자료의 데이터형과 통계수치, 결측치가 없는 것을 확인했고,그래프로 데이터를 파악해본 결과, Species를 분류하는데 있어서 유효한 데이터는 Peltal.Width, Petal.Length 로 확인 되었다.

트레이닝 테스트 셋 분리

# test set
# 모델을 평가 하는데 필요
# train/test sampling
training_sampling <- sort(sample(1:nrow(df),nrow(df)*0.7))
test_sampling <- setdiff(1:nrow(df), training_sampling)

# nrow, ncol
# nrow = number of row
# ncol = number of column

## traing_set, test_set
training_set <- df[training_sampling,]
test_set <- df[test_sampling,]

랜덤 포레스트 모델 생성

랜덤 포레스트 모델을 생성해보자 함수는 randomForest() 함수를 사용할 것이다. 랜덤포레스트 함수의 기본형태는 다음과 같다.
- randomForest(종속변수 ~ 독립 변수, data = df)

rf_m <- randomForest(Species ~ Petal.Length + Petal.Width, data = training_set)

rf_m

## 
## Call:
##  randomForest(formula = Species ~ Petal.Length + Petal.Width,      data = training_set) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 2.86%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         36          0         0  0.00000000
## versicolor      0         32         1  0.03030303
## virginica       0          2        34  0.05555556

위 결과해석을 보면 training set의 결과가 매우 좋게 나온 것을 알 수 있다. 그럼 이제 이 모델을 test set에 적용시켜보도록 하자.

test set 예측

test set을 예측해보자. predict() 함수를 사용할 것이며, 기본형태는 다음과 같다.
- predict(rf_m, newdata = df, type = “class”)

rf_p <- predict(rf_m, newdata = test_set, type = "class")

rf_p

##          5          9         12         18         25         27         30 
##     setosa     setosa     setosa     setosa     setosa     setosa     setosa 
##         32         35         36         37         39         45         49 
##     setosa     setosa     setosa     setosa     setosa     setosa     setosa 
##         51         52         54         55         58         67         69 
## versicolor versicolor versicolor versicolor versicolor versicolor versicolor 
##         71         74         76         77         84         85         86 
##  virginica versicolor versicolor versicolor  virginica versicolor versicolor 
##         87         88         91        101        102        104        111 
## versicolor versicolor versicolor  virginica  virginica  virginica  virginica 
##        113        116        119        130        133        135        139 
##  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
##        140        146        147 
##  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

이렇게 예측결과를 뽑아낼 수 있었다. 그런데 결과 종합에 대한 수치를 확인할 수 없기 때문에 혼동행렬(confusion matrix)을 사용할 것이다.

table(rf_p, test_set$Species)

##             
## rf_p         setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      0         15         0
##   virginica       0          2        14

이로써 랜덤 포레스트를 사용하여 iris 데이터의 speices를 분류하고, 이를 혼동행렬(confusion metrix)를 통해 모델의 정확성을 파악해보았다.