Regression with R package gbm

0. 개요

gbm은 분류(classification)와 회기분석(regression)을 지원하는 매우 강력한 알고리즘이다. R 패키지인 gbm을 이용하여 독립변수와 종속변수간의 관계를 살펴보는 회귀분석이다.
* 데이터셋: cars
* 독립변수 : cylinder
* 종속변수 : displacement + power + weight + acceleration + year

1. 데이터 다운로드

download.file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv", destfile = 'cars.csv')
cars <- read.csv('cars_20mpg.csv')

2. 데이터 살펴보기

아래 코드는 결측치의 유무를 살표보고 결측치를 각 컬럼의 평균값으로 채우는 코드이다.
결측치를 처리하지 않아도 gbm()은 결측치를 제외하고 훈련을 수행한다. 따라서 아래 코드를 굳이 수행할 필요는 없지만, 성능을 좀 더 높이기 위해서는 결측치를 수행하는 것이 좋다.

colSums(is.na(cars))   # 컬럼별로 결측값의 개수 확인

##          name       economy     cylinders  displacement         power 
##             0             8             0             0             6 
##        weight  acceleration          year economy_20mpg 
##             0             0             0             8

colMeans(cars[, 2:9], na.rm = T) # 각 컬럼의 평균값 확인

##       economy     cylinders  displacement         power        weight 
##    23.5145729     5.4753695   194.7795567   105.0825000  2979.4137931 
##  acceleration          year economy_20mpg 
##    15.5197044    75.9211823     0.5979899

#각 컬럼의 결측치를 각 컬럼의 평균값으로 넣는다
cars$economy[is.na(cars$economy)] <- mean(cars$economy, na.rm = T)
cars$power[is.na(cars$power)] <- mean(cars$power, na.rm = T)
cars$economy_20mpg[is.na(cars$economy_20mpg)] <- mean(cars$economy_20mpg, na.rm = T)

3. 데이터 분리.

아래 코드는 데이터셋을 train과 validation dataset으로 분리하는 코드이다.
그런데, gbm 패키지는 데이터를 분리할 필요가 없기 때문에(함수에서 자동 분류)하기 때문에 여기서는 주석 처리한다.

# rows <- sample(1:nrow(cars1), size = nrow(cars1)*0.3, replace = FALSE)
# train <- cars1[-rows,]
# valid <- cars1[rows,]

4. build and train the model

모델을 훈련시키는 gbm() 함수를 수행시키기 전에 함수에 사용되는 파라메터를 반드시 확인해야 한다.
특히, distribution, n.trees, train.fraction, class.stratify.cv, verbose를 확인한다.
나머지는 기본값을 써도 무방하다.

1. distribution 파라메타: 적절한 값이 들어가지 않으면, 수행이 되기는 하지만, 결과값이 모두 nan으로 나타남.

distribution 파라메터는 이 훈련에 사용할 분포(distribution) 이름 또는 list를 말한다.
distribution을 지정하지 않으면, gbm이 종속변수(여기서는 cylinders)의 유형에 따라 스스로 지정한다. 따라서 잘 모르겠으면 이 항목을 주석 처리한다.
distribution을 지정하지 않을 경우, 종속변수가
- 이항 분류이면: “bernoulli”
- factor형(다항분류)이면: “multinomial”
- class ‘Surv’ 이면: “coxph”
- 그 외는: gaussian
distribution을 지정할 경우는 종속변수의 유형에 따라 다음과 같이 지정해야 한다.
- 회귀(regression) 인 경우: “gaussian” (출력은 squared error) 또는 “laplace” (출력은 absolute loss) 또는 “tdist” (출력은 t-distribution loss)
- 이항분류인 경우: “bernoulli” (출력은 0-1의 logistic regression) 또는 “huberized” (출력은 0-1의 huberized hinge loss), “adaboost” (출력은 0-1의 AdaBoost exponential loss), “poisson” (출력은 count), “coxph” (출력은 right censored observations)
- “quantile” regression인 경우: distribution = list(name = “quantile”,alpha = 0.25): alpha는 추정할 Quantile
- “t-dist” regression인 경우: distribution = list(name = “tdist”,df = DF) : DF 자유도는 기본 4.
- “pairwise” regression인 경우: distrubution = list(name=“pairwise”,group=…,metric=…,max.rank=…)
- 그룹은 인스턴스가 속한 그룹 (일반적으로 정보 검색 응용 프로그램의 쿼리)을 공동으로 나타내는 데이터의 열 이름을 가진 문자형 벡터임.

2. 기타 파라메타

n.trees: tree 수. 때때로 많이 필요하나, Random Forest와 달리 과적합이 일어날 수 있으므로 성능 모니터링하면서 조절.
interaction.depth: the maximum depth of each tree. 대체적으로 1이 잘 작동함. 10보다 클 필요는 없음
shrinkage: learning rate or step-size reduction; 0.001 to 0.1 usually work, but a smaller learning rate typically requires more trees. 알고리즘이 기울기 하강을 얼마나 빨리 진행하는지 제어. 값이 작을수록 과적합의 가능성이 줄어들지만 최적의 적합을 찾는 시간도 늘어남.
train.fraction: train과 valid 데이텃셋의 비율. 1이면 valid는 nan으로 나타남.
cv.folds: cross-validation fold 개수. Cv.folds> 1이면 gbm은 일반적인 맞춤 외에 교차 검증을 수행하고 cv.error에 반환 된 일반화 오류의 추정치를 계산함.
verbose: 모델 수행 시 진행 상황과 성능 지표를 보여줄 것인지 말 것인지.(T/F)
class.stratify.cv: mulitnomial, binomial에서만 사용. 교차 유효성 검사를 클래스별로 계층화해야하는지 여부를 나타내는 논리값. 교차 유효성 검사를 계층화하는 목적은 훈련 세트에 모든 클래스가 포함되어 있지 않은 상황을 피하는 데 도움이 됨.
n.cores = NULL # will use all cores by default
weights: distribution이 pairwise인 경우에만 사용함. 양수.

library(gbm)
# 여러번 수행시키면 gbm은 이전 훈련을 기억하고 있다. 따라서 새롭게 수행시키기 위해서 seed()함수를 사용한다.
set.seed(123)

cars_gbm <- gbm(
  formula = cylinders ~ displacement + power + weight + acceleration + year,
  distribution = "gaussian", 
  data = cars,
  #weights,
  var.monotone = NULL,
  n.trees = 100,
  interaction.depth = 3, 
  n.minobsinnode = 10,
  shrinkage = 0.1,       
  bag.fraction = 0.5,
  train.fraction = 0.7,
  cv.folds = 2,
  keep.data = TRUE,
  verbose = TRUE,
  n.cores = NULL
)

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        2.4930          2.2501     0.1000    0.5672
##      2        2.0295          1.8223     0.1000    0.4606
##      3        1.6556          1.4806     0.1000    0.3527
##      4        1.3537          1.2177     0.1000    0.3084
##      5        1.1059          0.9970     0.1000    0.2515
##      6        0.9057          0.8157     0.1000    0.1895
##      7        0.7467          0.6692     0.1000    0.1503
##      8        0.6219          0.5536     0.1000    0.1244
##      9        0.5158          0.4529     0.1000    0.1001
##     10        0.4321          0.3760     0.1000    0.0888
##     20        0.1053          0.0943     0.1000    0.0099
##     40        0.0525          0.0546     0.1000    0.0004
##     60        0.0433          0.0573     0.1000   -0.0004
##     80        0.0374          0.0570     0.1000   -0.0004
##    100        0.0335          0.0594     0.1000   -0.0003

5. 성능 확인

위와 같이 훈련시키고 나면 gbm object(여기서는 cars_gbm)가 만들어지는데, 이 오브젝트의 의미는 다음과 같음. cars_gbm$…
* iniiF’절편(intercept)’, 나무가 조정하는 초기 예측 값
* fit: 회귀 척도 함수에 적합치를 포함하는 벡터입니다 (예 : 베르누이의 로그 홀수 스케일, 포아송의 로그 스케일)
* train.error: 훈련 데이터에서 평가된 각 boosting 반복에 대한 손실 함수 값을 포함하는 적합 트리 수와 길이가 같은 벡터.

print(cars_gbm)

## gbm(formula = cylinders ~ displacement + power + weight + acceleration + 
##     year, distribution = "gaussian", data = cars, var.monotone = NULL, 
##     n.trees = 100, interaction.depth = 3, n.minobsinnode = 10, 
##     shrinkage = 0.1, bag.fraction = 0.5, train.fraction = 0.7, 
##     cv.folds = 2, keep.data = TRUE, verbose = TRUE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 100 iterations were performed.
## The best cross-validation iteration was 44.
## The best test-set iteration was 37.
## There were 5 predictors of which 5 had non-zero influence.

summary(
  cars_gbm, 
  cBars = 10,
  method = relative.influence, # also can use permutation.test.gbm
  las = 2
)

##                       var    rel.inf
## displacement displacement 94.5556554
## power               power  2.5115902
## weight             weight  2.0266214
## acceleration acceleration  0.6987471
## year                 year  0.2073859

아래 gbm.perf에 사용된 파라메터 중 method는 최적의 부스팅 반복 횟수를 추정하는 데 사용되는 방법으로 다음과 같은 옵션이 있음.
* method = “OOB” OOB(Out-of-Bag) 추정치 계산
* method = “test” uses the test (or validation) dataset to compute an out-of-sample estimate
* method = “cv” extracts the optimal number of iterations using cross-validation if gbm was called with cv.folds > 1.

gbm.perf(cars_gbm, plot.it = T, oobag.curve = T, method = 'OOB')

## [1] 26
## attr(,"smoother")
## Call:
## loess(formula = object$oobag.improve ~ x, enp.target = min(max(4, 
##     length(x)/10), 50))
## 
## Number of Observations: 100 
## Equivalent Number of Parameters: 8.32 
## Residual Standard Error: 0.01668

# find index for n trees with minimum CV error
min_MSE <- which.min(cars_gbm$cv.error)

# get MSE and compute RMSE
sqrt(cars_gbm$cv.error[min_MSE])

## [1] 0.3586259