Statistical Analytics - rxDTree

Sample Code


사용할 데이터는 Revolution 폴더안의 Sample Data로, "mortDefaultSmall2009"라는 이름의 csv파일이다.  

데이터의 행 수는 총 10000개 이고, 'creditScore', 'houseAge', 'yearsEmploy', 'ccDebt', 'year', 'default' 로 총 6개의 변수(열)를 가지고 있다.  

현재 사용한 데이터는 2009년 데이터이지만, 실제는 2000년부터 2009년까지 10년간의 sample data가 있다.  

뒤에 데이터 합치기와 같은 부분에서 다른 연도의 데이터도 사용될 것이다.

1.Data Import하기


# (1-1) data의 위치지정 Revolution R을 다운할 시, 자동으로 생성되는 Sample
# data중 하나인 데이터로, sampleData폴더의 'mortDefaultSmall2009'csv파일을
# 사용

text_mort <- "C:/Revolution/R-Enterprise-6.1/R-2.14.2/library/RevoScaleR/SampleData/mortDefaultSmall2009.csv"

# (1-2) rxImport를 사용한 data import

data_mort <- rxImport(inData = text_mort, outFile = "data_mort.xdf", overwrite = TRUE)

2.rxDTree를 이용한 의사결정나무 그리기


# rxDTree를 이용한 의사결정나무 그리기

# (1-3) 노드의 깊이를 지정하는 maxeDepth를 지정하지 않은(기본값=10)
# Decision Tree

mort_DT <- rxDTree(creditScore ~ yearsEmploy + houseAge + ccDebt, data = data_mort)

plot(rxAddInheritance(mort_DT))
text(rxAddInheritance(mort_DT))

plot of chunk unnamed-chunk-3


# (1-4) maxDepth를 3으로 지정한 Decision Tree

mort_DT1 <- rxDTree(creditScore ~ yearsEmploy + houseAge + ccDebt, data = data_mort, 
    maxDepth = 3)

plot(rxAddInheritance(mort_DT1))
text(rxAddInheritance(mort_DT1))

plot of chunk unnamed-chunk-5

첫번째의 의사결정나무 모양과 비교했을 때, maxDepth이 작은 것이 노드의 깊이가 작음을 알 수 있다.
노드가 너무 길게 뻗게 되면 오히려 전체적인 데이터를 살펴보기가 어렵게 된다.
해서, 이런경우에는 후에 가지치기(prune)과정을 거쳐주어야 한다.

3.Cp table 사용하여 최적의 split 갯 수 찾기


# (1-6) cp Table로 최적의 노드 갯 수 찾기

mort_DT$cptable

##           CP nsplit rel error xerror    xstd
## 1  4.750e-04      0    1.0000  1.001 0.01404
## 2  4.128e-04      2    0.9991  1.014 0.01419
## 3  3.959e-04      8    0.9963  1.015 0.01421
## 4  3.669e-04      9    0.9960  1.016 0.01422
## 5  3.371e-04     11    0.9952  1.017 0.01423
## 6  3.321e-04     12    0.9949  1.017 0.01423
## 7  3.239e-04     15    0.9939  1.017 0.01423
## 8  2.641e-04     16    0.9936  1.018 0.01426
## 9  2.577e-04     17    0.9933  1.019 0.01426
## 10 2.468e-04     19    0.9928  1.019 0.01426
## 11 2.242e-04     20    0.9925  1.020 0.01426
## 12 2.238e-04     23    0.9919  1.020 0.01426
## 13 2.217e-04     27    0.9910  1.020 0.01426
## 14 2.190e-04     28    0.9907  1.020 0.01426
## 15 2.102e-04     36    0.9888  1.020 0.01426
## 16 1.988e-04     37    0.9886  1.020 0.01428
## 17 1.964e-04     39    0.9882  1.020 0.01428
## 18 1.956e-04     41    0.9878  1.020 0.01428
## 19 1.942e-04     46    0.9868  1.020 0.01428
## 20 1.808e-04     47    0.9866  1.020 0.01427
## 21 1.603e-04     50    0.9861  1.020 0.01427
## 22 1.595e-04     51    0.9859  1.021 0.01428
## 23 1.535e-04     57    0.9850  1.021 0.01428
## 24 1.394e-04     59    0.9847  1.022 0.01429
## 25 1.191e-04     60    0.9845  1.022 0.01429
## 26 1.175e-04     62    0.9843  1.022 0.01430
## 27 1.166e-04     65    0.9839  1.022 0.01430
## 28 1.136e-04     66    0.9838  1.022 0.01430
## 29 1.104e-04     67    0.9837  1.022 0.01430
## 30 1.074e-04     69    0.9835  1.022 0.01431
## 31 9.968e-05     70    0.9834  1.022 0.01431
## 32 9.964e-05     71    0.9833  1.022 0.01431
## 33 9.752e-05     72    0.9832  1.022 0.01431
## 34 9.570e-05     73    0.9831  1.022 0.01431
## 35 8.750e-05     74    0.9830  1.022 0.01430
## 36 7.547e-05     75    0.9829  1.022 0.01430
## 37 6.565e-05     76    0.9828  1.022 0.01430
## 38 6.259e-05     77    0.9828  1.022 0.01430
## 39 6.184e-05     78    0.9827  1.022 0.01430
## 40 6.149e-05     81    0.9825  1.022 0.01430
## 41 6.050e-05     83    0.9824  1.022 0.01430
## 42 5.860e-05     85    0.9823  1.022 0.01430
## 43 5.456e-05     86    0.9822  1.022 0.01430
## 44 5.417e-05     88    0.9821  1.022 0.01430
## 45 4.997e-05     90    0.9820  1.022 0.01430
## 46 4.766e-05     91    0.9819  1.022 0.01430
## 47 3.866e-05     92    0.9819  1.022 0.01430
## 48 3.863e-05     93    0.9819  1.022 0.01430
## 49 3.491e-05     96    0.9817  1.022 0.01430
## 50 3.175e-05     99    0.9816  1.022 0.01430
## 51 3.049e-05    100    0.9816  1.022 0.01430
## 52 2.944e-05    102    0.9815  1.022 0.01430
## 53 2.522e-05    103    0.9815  1.022 0.01430
## 54 2.469e-05    104    0.9815  1.022 0.01430
## 55 2.403e-05    105    0.9815  1.022 0.01430
## 56 2.242e-05    106    0.9814  1.022 0.01430
## 57 2.161e-05    107    0.9814  1.022 0.01430
## 58 1.735e-05    108    0.9814  1.022 0.01430
## 59 1.381e-05    109    0.9814  1.022 0.01430
## 60 1.221e-05    110    0.9814  1.022 0.01430
## 61 1.218e-05    111    0.9814  1.022 0.01430
## 62 3.843e-06    112    0.9813  1.022 0.01430
## 63 1.483e-06    113    0.9813  1.022 0.01430
## 64 1.246e-06    114    0.9813  1.022 0.01430
## 65 1.027e-06    115    0.9813  1.022 0.01430
## 66 3.936e-07    116    0.9813  1.022 0.01430
## 67 5.091e-08    117    0.9813  1.022 0.01430
## 68 0.000e+00    118    0.9813  1.022 0.01430

위의 테이블에서 cross-validation error(xerror)가 nsplit이 증가함에 따라 드라마틱하게 줄어드는 지점을 찾으면 그 지점이 가장 최적의 split 갯수라고 할 수 있다.

이 데이터의 경우, xerror가 점점 증가하다가 10개 근처에서 감소하고 있으므로, nsplit이 10개 근처에서 가장 최적의 모형이 나옴을 예상할 수 있다.

plotcp(rxAddInheritance(mort_DT))

plot of chunk unnamed-chunk-7

위의 그래프를 확인해서도 nsplit이 10개 정도일 때, error값이 점점 비슷하게 줄게 됨을 확인할 수 있다.

4.가지치기(Prune)작업하기


prune.mort_DT1 <- prune.rxDTree(mort_DT, cp = 0.001)
plot(rxAddInheritance(prune.mort_DT1))

## Error: fit is not a tree, just a root


prune.mort_DT2 <- prune.rxDTree(mort_DT, cp = 1e-04)
plot(rxAddInheritance(prune.mort_DT2))
text(rxAddInheritance(prune.mort_DT2))

plot of chunk unnamed-chunk-8

첫 번째 prune.DT1은 error가 뜨면서 그림이 그려지지 않게 되는데, 이는 가지치기를 하고 나니 하나의 노드밖에 남지 않아서 그림이 그려지지 않은 것이다.

해서, cp값을 좀 더 0에 가깝게 하여 그려보니 위와 같은 그림이 나오는 것을 확인할 수 있다.

여전히 그림은 복잡해보이지만, 맨 위의 1번에서 그렸던 모형보다는 약간 가지치기가 되었음을 확인할 수 있다.

Appendix

rxDTree

Usage

rxDTree(formula, data, outFile = NULL, outColName = “.rxNode”, writeModelVars = FALSE, overwrite = FALSE, pweights = NULL, fweights = NULL, method = NULL, parms = NULL, cost = NULL, minSplit = max(20, sqrt(numObs)), minBucket = round(minSplit/3), maxDepth = 10, cp = 0, maxCompete = 0, maxSurrogate = 0, useSurrogate = 2, surrogateStyle = 0, xVal = 2, maxNumBins = NULL, maxUnorderedLevels = 32, removeMissings = FALSE, pruneCp = 0, rowSelection = NULL, transforms = NULL, transformObjects = NULL, transformFunc = NULL, transformVars = NULL, transformPackages = NULL, transformEnvir = NULL, blocksPerRead = rxGetOption(“blocksPerRead”), reportProgress = rxGetOption(“reportProgress”), verbose = 0, computeContext = rxGetOption(“computeContext”), xdfCompressionLevel = rxGetOption(“xdfCompressionLevel”), …)

Hankuk University of Foreign Studies. Dept of Statistics. Daewoo Choi Lab. Yeeseul Han.
한국외국어대학교 통계학과 최대우 연구실 한이슬
e-mail : han.lolove17@gmail.com