이 글은 ’The R Journal Vol. 9/1, June 2017’에 실린 pdp: An R Package for Constructing Partial Dependence Plots을 바탕으로 작성(번역 + 약간의 내용 추가)되었습니다.

Abstract

Neural networks, Random forests, Support vector machines과 같은 복잡한 비모수 모델은 예측 분석에서 그 어느 때보다 많이 사용되며, 특히 기존의 선형 통계 기법에 부과되는 엄격한 선형적 가정(다중선형회귀에서 선형성, 등분산성, 정규성)을 준수하지 않는 대규모 관측 데이터베이스를 다룰 때 더욱 자주 사용된다. 안타깝게도 이러한 모델의 결과를 이해하고 경영진에게 설명하는 것은 어려울 수 있다. Partial dependence plot은 간단한 해결책을 제공한다. Partial dependence plot은 예측 함수의 저차원 그래픽 렌더링을 의미하므로 결과(outcome)과 관심있는 예측 변수(predictors) 간의 관계를 보다 쉽게 이해할 수 있게 해준다. 이러한 그림은 블랙 박스 모델의 출력(output)을 설명하는 데 특히 유용하다. 이 글에서는 Partial dependence plot을 구성하기 위한 일반적인 R 패키지인 pdp를 소개한다.

Introduction

Harrison and Rubinfeld (1978)는 R 사용자들에게 널리 알려진 Boston housing data를 최초로 분석한 사람들이다. 그들의 목표 중 하나는 1970년 인구조사로부터 보스턴 교외에 있는 n = 506개 인구 조사 구역의 주택 가치의 중간 값에 대한 데이터를 사용하여 주택 가치 방정식을 찾는 것이었다. 데이터는 선형성, 정규성 및 상수 분산과 같은 많은 고전적 가정을 위반힌다. 그럼에도 불구하고, Harrison과 Rubinfeld는 변환, 유의성 테스트 및 그리드 서치를 조합하여 합리적인 적합 모델을 찾을 수 있었다(R2 = 0.81). 그 시간과 노력에 대한 보상의 일부는 식 (1)에 재현된 해석 가능한 예측 방정식이었다.

\[\begin{aligned} \widehat{log(MV)} = &9.76 + 0.0063RM^2 + 8.98 \times 10^{-5} AGE - 0.19log(DIS) + 0.096log(RAD) \\ &- 4.20 \times 10^{-4}TAX - 0.031 PTRATIO + 0.36(B - 0.63)^2 - 0.37 log(LSTAT) \\ &- 0012CRIM + 8.03 \times 10^{-5} ZN + 2.41 \times 10^{-4} INDUS + 0.088CHAS - 0.0064NOX^2 \quad \quad (1) \end{aligned}\]

지도 학습 알고리즘을 사용하면 일반적으로 더 높은 정확도로 몇 초 안에 자동으로 데이터를 적합시킬 수 있다. 그러나 이러한 알고리즘은 일반적으로 식 (1)과 같은 단순한 예측 공식을 생성하지 않기 때문에 해석력이 떨이진다. 이러한 모델은 여전히 데이터에 대한 통찰력을 제공할 수 있지만 단순한 방정식의 형태는 아니다. 예측 변수의 중요성을 정량화하는 것은 “빅 데이터” 분석에서 필수적인 작업이 되었으며, 트리 기반 방법과 같이 많은 지도 학습 알고리즘이 트레이닝 데이터의 모든 예측 변수에 변수 중요도 점수를 자연스럽게 할당할 수 있다.

예측 변수의 중요도를 결정하는 것은 모든 지도 학습 문제에 있어 중요한 작업이지만, “중요” 예측 변수들(features)의 하위 집합이 확인되면 이들(또는 하위 집합)과 반응변수(response)의 관계를 평가할 필요가 있는 경우가 많다. 이것은 여러 가지 방법으로 이루어질 수 있지만, 기계 학습에서는 종종 Partial Dependence Plot(PDP)을 구성함으로써 달성된다. 자세한 내용은 Friedman(2001)을 참조하면 된다(고 하는데 이 논문에는 자세히 설명 되어 있지 않은 것 같고, Elements of Statistical Learning을 참고하는 것이 더 나아 보인다.). PDP는 예측 변수들의 하위 집합(일반적으로 1-3)과 반응 변수 사이의 관계를 시각화하는 데 도움이 되는 동시에 모형의 다른 예측 변수의 평균 효과를 고려한다. 그것들은 Random forest와 Support vector machine과 같은 블랙박스 모델에 특히 효과적이다.

$x=\{x_1, x_2, \cdots, x_p\}$가 예측 함수가 $\hat{f(x)}$인 모델의 예측 변수들을 나타낸다고 하자. $x$를 관심있는 집합 $z_s$와 그 compliment($z_c$)로 분할하면($z_s$와 $z_c$는 mutually exclustive set이고, $z_s$와 $z_c$의 union이 $x$가 됨) $z_s$에 대한 반응변수의 partial dependence는 다음과 같이 정의된다.

\[f_s(z_s) = E_{z_c}[\hat{f}(z_s, z_c)]=\int\hat{f}(z_s,z_c)p_c(z_c)dz_c \quad \quad (2)\]

where $p_c(z_c)$ is the marginal probability density of $z_c$: $p_c(z_c)= \int p(x) dz_c$

식 (2)는 training data로부터 다음과 같이 추정될 수 있다.

\[\bar{f_s}(z_s)=\frac{1}{n}\sum_{i=1}^n\hat{f}(z_s, z_{i, c}) \quad \quad (3)\]

where $z_{i,c}(i=1,2, ... , n)$ are the values of $z_c$ that occur in the training sample.

즉, 모형의 다른 모든 예측 변수의 효과를 평균화하는 것이다.

식을 보면 뭔가 복잡해 보이지만 실제로 PDP를 그리는 과정은 간단하다.

간단한 설명을 위해 $z_s=x_1$이라 하자. $x_1$은 우리가 관심이 있는 어떤 변수 하나를 의미한다. 이 변수에 대해 $\{x_{11}, x_{12}, \cdots, x_{1k}\}$라는 유일한 상수 값들을 생각해볼 수 있다. $x_1$에 대한 반응변수의 partial dependece는 다음 과정을 통해 구할 수 있다.

Algorithm 1: A simple algorithm for constructing the partial dependence of the response on a single predictor $x_1$.

$1.$ For $i \in \{1,2, \cdots, k\}:$

(a) training data를 복사하고 $x_1$의 원래 값들을 상수 $x_{1i}$로 바꾼다.

(b) (a)의 데이터로 predicted value들을 구한다.

(c) $\bar{f_1}(x_{1i})$를 구하기 위해 예측값의 평균을 계산한다.

$2.$ $i=1,2,\cdots,k$에 대해 $\{ x_{1i}, \bar{f_1}(x_{1i}) \}$를 그린다.

Boston data를 이용해서 간단한 계산을 해보자.

# Fit a random forest to the boston housing data
library(pdp)
library(randomForest)
data(boston)  # load the boston housing data
set.seed(101)  # for reproducibility
boston.rf <- randomForest(cmedv ~ ., data = boston)

partial.lstat <- partial(boston.rf, pred.var = "lstat")
head(partial.lstat)

##    lstat     yhat
## 1 1.7300 31.14616
## 2 2.4548 31.14439
## 3 3.1796 31.10290
## 4 3.9044 30.62720
## 5 4.6292 28.96845
## 6 5.3540 27.00983

lstat.grid <- seq(min(boston$lstat), max(boston$lstat), length.out = round(nrow(boston)*0.1))
head(lstat.grid)

## [1] 1.7300 2.4548 3.1796 3.9044 4.6292 5.3540

identical(partial.lstat$lstat, lstat.grid)

## [1] TRUE

partial() 함수를 이용하면 지정한 변수(여기서는 lstat)에 대해 partial dependence를 계산하여 데이터 프레임으로 반환한다.
본문의 설명에 나온 $\{x_{11}, x_{12}, \cdots, x_{1k}\}$는 해당 변수의 최솟값과 최댓값을 경계값으로 하여 등간격으로 만든 sequence이다. k는 training data의 10%정도로 구성하는 것 같다.

nrow(boston)

## [1] 506

nrow(partial.lstat)

## [1] 51

train.temp <- boston
train.temp["lstat"] <- rep(lstat.grid[1])
pred.temp <- predict(boston.rf, newdata = train.temp)
mean(pred.temp)

## [1] 31.14616

partial.lstat[1,]

##   lstat     yhat
## 1  1.73 31.14616

yhat.lstat <- c()
for (i in 1:length(lstat.grid)){
  train.temp <- boston
  train.temp["lstat"] <- rep(lstat.grid[i])
  pred.temp <- predict(boston.rf, newdata = train.temp)
  yhat.lstat[i] <- mean(pred.temp)
}

identical(partial.lstat$yhat, yhat.lstat)

## [1] TRUE

partial.lstat1 <- cbind(lstat.grid, yhat.lstat)

# Using randomForest's partialPlot function
par(mfrow=c(1,2))
partialPlot(boston.rf, pred.data = boston, x.var = "lstat", main="partialPlot function")
plot(lstat.grid, yhat.lstat, type="l", main="User defined function")

Algorithm 1은 grid를 어떻게 정하냐에 따라 computationally intensive해질 수 있다. 다행히도 알고리즘은 꽤 쉽게 병렬화할 수 있다(Section 2.4에서 이 내용을 다룸). 또한 두 개 이상의 feature의 더 큰 하위 집합으로 쉽게 확장할 수 있다.

기존에 있는 R 패키지들에서는 PDP를 제한적으로만 그릴 수 있다. randomForest 패키지의 partialPlot() 함수는 ‘randomForest’ class를 가진 객체에만 적용할 수 있고, gbm 패키지의 plot() 함수도 ‘gbm’ 객체에만 적용할 수 있다. PDP는 트리 기반 알고리즘들 뿐만아니라 GAM이나 NN같은 지도학습 알고리즘이기만 하면 그릴 수 있다.

plotmo((Milborrow, 2017b))패키지는 pdp 패키지의 대안 중 하나이다. Milborrow에 따르면 plotmo 패키지는 “a poor man’s partial dependence plot”을 만든다. 특히, 다른 예측 변수를이 상수로 고정된 상태에서(연속형 변수는 중위수로 고정되고, 범주형 변수는 첫 번째 수준으로 고정됨), 한 개 또는 두 개의 예측 변수의 값이 변할 때 모델의 반응값을 그린다. 한 번에 두 개 변수까지 그릴 수 있고, PDP 보다는 덜 정확하지만 빠르게 그래프를 그릴 수 있다. 상호작용 항이 없는 additive model에 대해서는 PDP와 형태가 동일하다. 3.3.0 버전부터는 PDP를 지원하지만 기본 값으로 지정되어 있는 것은 아니다. plotmo 패키지와 pdp 패키지의 가장 큰 차이점은 plotmo 패키지에서는 Algorithm 1에서 step 1의 (a)~(c)를 적용하지 않고, 모든 데이터를 한 번에 누적하기 때문에 predict 함수의 내부 호출 횟수가 줄어든다는 것이다. 그렇다면 왜 pdp 패키지를 사용해야 할까?

As will be discussed in the upcoming sections, pdp:

• contains only a few functions with relatively few arguments;
• does not produce a plot by default;
• can be used more efficiently with "gbm" objects (see Section 2.4);
• produces graphics based on lattice (Sarkar, 2008), which are more flexible than base R graphics;
• defaults to using false color level plots for multivariate displays (see Section 2.2);
• contains options to mitigate the risks associated with extrapolation (see Section 2.4);
• has the option to display progress bars (see Section 2.4);
• has the option to construct PDPs in parallel (see Section 2.4);
• is extremely flexible in the types of PDPs that can be produced (see Section 2.6),

PDP는 상당한 상호작용이 존재하는 경우 오도될 수 있다. (Goldstein et al., 2015) 이를 극복하기 위해 Goldstein, Kapelner, Bleich, and Pitkin은 individual conditional expectation (ICE) plots이라는 개념을 만들었다. (ICEbox 패키지에서 사용 가능) ICE plot은 각각의 관측치에 대해서 반응변수와 관심있는 예측변수의 추정된 관계를 보여준다. 결과적으로 관심있는 예측변수에 대한 PDP는 모든 관측치에 걸쳐 해당하는 ICE 곡선을 평균하여 얻을 수 있다. Section 2.6에서 pdp 패키지를 이용하여 ICE 곡선을 어떻게 그릴 수 있는지 보여줄 것이다. ICEbox 패키지로 단일 예측 변수에 대한 PDP를 그릴 수 있는데 ?ICEbox::plot.ice를 통해 예제를 확인해 볼 수 있다. 색상으로 다른 예측변수들에 대한 정보를 효과적으로 나타낼 수 있기는 하지만, ICEbox 패키지를 사용하면 한 번의 한 변수에 대해서만 그래프를 그릴 수 있다. ICEbox 패키지를 이용하면 centered ICE(c-ICE) plot과 derivative ICE(d-ICE) plot도 그릴 수 있다. c-ICE plot은 관측치 간의 모델링된 관계에서의 이질성(heterogeneity)을 시각화 할 수 있게 해주고, d-ICE plot은 상호작용 효과를 탐색할 수 있게 해준다.

적합 모형에 기초한 반응 변수와 예측 변수 간의 관계를 시각화하기 위한 많은 다른 기법이 있다. 예를들어 car 패키지에는 partial-residual plot(선형성을 확인하는데 사용됨)과 marginal-model plot을 그리는 함수가 내장되어 있다. Effect displays는 effects 패키지를 통해서 가능한데, plotmo 패키지의 marginal model plot과 비슷하게 다른 예측변수들이 고정된 상태에서 parametric model의 관심있는 예측변수에 대한 그래프를 제공한다. 하지만 이런 패키지들은 간단한 parametric model(linear, generalized linear model)을 위해 만들어진 것이다. Black box 모델들에 대해서는 plotmo, ICEbox, pdp 패키지가 더 유용하다.

Constructing PDPs in R

pdp 패키지에서 중요하게 사용되는 두 가지 함수는 다음과 같다:

partial
plotPartial

partial 함수는 예측변수 값들의 그리드에 대해 적합된 모델로부터 partial dependence를 계산한다. 적합된 모델과 예측변수들은 각각 object와 pred.var 인자자(arguments)를 사용하여 지정된다 - 이 두 개만 필수적인 인자이다. 만약 plot = FALSE(디폴트 값)이면 partial 함수는 “data.frame” 클래스로부터 상속된, “partial” 클래스의 객체를 반환한다. partial 함수는 기본적으로 plotPartial에 의해 인식되는 추가 클래스가 포함된 데이터 프레임을 반환한다. 데이터 프레임의 열들은 pred.var 인자에 지정한 순서대로 나오고, yhat으로 이름이 붙는 맨 마지막 열은 partial dependence fucntion $\bar{f_s}(z_s)$의 값들을 포함한다. 만약 plot = TRUE이면 partial 함수는 내부적으로 plotParital 함수를 호출하고, PDP를 lattice plot(즉, “trellis” 객체)의 형태로 반환한다. NOTE: partial 함수는 plot = FALSE 로 지정하여 결과를 저장한 후 사용하는 것이 좋다. 이렇게 해야 더 유연하게 그래프를 그릴 수 있고, 기본 plot이 충분하지 않을 때 partial 함수를 다시 호출하는 낭비를 막을 수 있다.

plotPartial 함수를 사용하면 여러가지 plotting 옵션들을 사용할 수 있다. plotPartial 함수는 단순히 “partial” 클래스의 객체를 손쉽게 plotting 할 수 있도록 하는 것이다. ggplot2 패키지 같은 다른 시각화 패키지로도 partial 함수의 결과를 시각화 할 수 있다.

NOTE: pdp 패키지는 lattice 패키지에 의존하고 있는데 lattice 패키지는 grid 패키지를 기반으로 하고 있다. grid 그래픽스는 기본 R 그래픽스와는 조금 다르게 작동하는데 다음 두 가지를 알고 있으면 좋다.

lattice의 함수는 “trellis” 객체를 반환하지만 표시하지는 않는다; print method를 통해 표시된다. 하지만 R의 automatic printing rule때문에 커맨드 라인에 이 함수를 사용했을 때 결과는 자동적으로 프린트된다. 만약 plotPartial 함수가 source나 for 또는 while 같은 반복문에 사용되면 명시적으로 print 문을 사용해야 그래프를 그려준다. 이것은 partial 함수에서 plot = TRUE 옵션을 사용하는 것과 같다.
par를 통해 그래픽 파라미터들을 지정하는 것은 lattice plot에 영향을 주지 않는다. 대신 lattice는 그래픽 파라미터들을 지정하기 위한 trellis.par.set 함수를 제공한다. 여러 개의 lattice plot을 그릴 때는 latticeExtra 패키지나 gridExtra 패키지를 사용하는 것이 좋다.

pdp 패키지에서는 gridExtra 패키지의 grid.arrange 함수를 import하여 여러 개의 그래픽 객체들을 단일 plot에 그리는 기능을 지원하고 있다.

pdp 패키지에서는 아래 테이블에 모델들을 지원한다. 이 모델들에 대해서는 partial 함수의 pred.fun 인자나 type 인자(“regression” 또는 “classification”)를 따로 지정해주지 않아도 된다. 아래 테이블에 제시되지 않은 nnet 패키지나 사용자가 만든 예측 함수를 사용하는 경우에는 이 두 가지 인자들을 지정해주어야 한다.

Table 1: Models specifically supported by the pdp package. Note: for some of these cases, the user may still need to supply additional arguments in the call to partial.

partial 함수는 caret 패키지의 “train” 클래스를 지원하는데 이것은 caret 패키지에서 지원하는 모델이면 partial 함수를 사용할 수 있다는 것이다. caret 패키지에서 지원하는 모델의 리스트는 다음 링크를 통해 확인할 수 있다. http://topepo.github.io/caret/available-models.html

partial 함수에서 중요한 또 다른 인자는 train이다. 만약 train = NULL(디폴트) 이면 partial은 적합된 모델 객체로부터 training data를 뽑아내려고 할 것이다. training data의 복사본을 저장하는 객체들(“BinaryTree”, “RandomForest”, and “train” 클래스를 가진 객체들)의 경우 이것은 간단하다. 그렇지 않으면 partial은 개체에 저장된 call을 추출하여(가능한 경우) partial이 호출된 동일한 환경에서 training data를 평가하는 데 사용한다. 이것은 training data가 모델을 적합한 후이지만 partial 함수를 호출하기 전에 변경된 경우 문제가 될 수 있다. 따라서 train 인자를 명시적으로 지정해주는 것이 좋다. 만야 train = NULL 이면 training data는 적합된 모델에서 추출될 수 없다. “ksvm”이나 " xgb.Booster" 객체에 partial 함수를 사용 할 때 사용자들은 다음과 같은 에러 메시지를 볼 수 있다.

Error: The training data could not be extracted from object. Please supply the raw training data using the train argument in the call to partial.

적용 사례를 제시하기 위해 Boston housing data를 교정한 버전인 pdp::boston 데이터를 사용한다.

data(boston, package = "pdp") # load the (corrected) Boston housing data
library(randomForest) # for randomForest, partialPlot, and varImpPlot functions
set.seed(101) # for reproducibility
boston.rf <- randomForest(cmedv ~ ., data = boston, importance = TRUE)
boston.rf

## 
## Call:
##  randomForest(formula = cmedv ~ ., data = boston, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 9.08313
##                     % Var explained: 89.21

The model fit is reasonable, with an out-of-bag (pseudo) $R^2$ of 0.89.

varImpPlot(boston.rf) # Figure 1

the percentage of lower status of the population (lstat)과 the average number of rooms per dwelling (rm)이 the median value of owner-occupied homes (cmedv)와 밀접한 관련이 있는 변수라는 것을 알 수 있다.
그렇다면 이런 관계의 근원은 무엇일까?
이에 답하기 위해 lstat과 rm에 대한 cmedv의 partial dependnece를 살펴볼 수 있다.

Single predictor PDPs

randomForest 패키지는 단일 예측변수에 대한 PDP를 그릴 수 있는 partialPlot 함수를 제공한다.

partialPlot(boston.rf, pred.data = boston, x.var = "lstat")

pdp 패키지의 partial 함수에 plot = TRUE 옵션을 사용하면 같은 그래프를 그릴 수 있다. 유일한 차이는 pdp 패키지는 그래프를 그리는데 lattice 패키지를 사용한다는 것이다.

plot을 커스터마이징 하기 위해 plot = FALSE 옵션을 사용하여 데이터 프레임으로 결과를 저장한 후 plotPartial 함수를 사용할 수 있다.

This is illustrated in the example below which increases the line width(lwd = 2), adds a LOESS smooth(smooth = TRUE), and customizes the y-axis label(ylab = expression(f(lstat))).

Note: to encourage writing more readable code, the pipe operator %>% provided by the magrittr package (Bache and Wickham, 2014) is exported whenever pdp is loaded.

library(pdp) # for partial, plotPartial, and grid.arrange functions
p1 <- partial(boston.rf, pred.var = "lstat", plot = TRUE) # Figure 2 (left)
# Figure 2 (right)
p2 <- boston.rf %>% # the %>% operator is read as "and then"
  partial(pred.var = "lstat") %>%
  plotPartial(smooth = TRUE, lwd = 2, ylab = expression(f(lstat)))
grid.arrange(p1, p2, ncol = 2)

Figure 2: Partial dependence of cmedv on lstat based on a random forest. Left: Default plot. Right: Customized plot obtained using the plotPartial function.

Multi-predictor PDPs

partial 함수를 사용하는 것의 이점은 다음의 3가지로 나뉜다:

(1) random forest 뿐만 아니라 다른 여러가지 예측 모델에 적용할 수 있는 함수이다.

(2) it will allow for any number of predictors to be used (e.g., multivariate displays)

Q. 예측변수 두 개까지만 되는거 아닌가?

(3) foreach 패키지에 의해 지원되는 병렬 백앤드를 이용할 수 있다.

For example, the following code chunk uses the random forest model to assess the joint effect of lstat and rm on cmedv. The grid.arrange function is used to display three PDPs, which make use of various plotPartial options3, on the same graph. The results are displayed in Figure 3.

# Compute partial dependence data for lstat and rm
pd <- partial(boston.rf, pred.var = c("lstat", "rm"))

# Default PDP
pdp1 <- plotPartial(pd)

# Add contour lines and use a different color palette
rwb <- colorRampPalette(c("red", "white", "blue"))
pdp2 <- plotPartial(pd, contour = TRUE, col.regions = rwb)

# 3-D surface
pdp3 <- plotPartial(pd, levelplot = FALSE, zlab = "cmedv", drape = TRUE,
                    colorkey = TRUE, screen = list(z = -20, x = -60))

# Figure 3
grid.arrange(pdp1, pdp2, pdp3, ncol = 3)

Figure 3: Partial dependence of cmedv on lstat and rm based on a random forest. Left: Default plot. Middle: With contour lines and a different color palette. Right: Using a 3-D surface.

Note: the default color map for level plots is the color blind-friendly matplotlib (Hunter, 2007) ’viridis’ color map provided by the viridis package (Garnier, 2017).

Avoiding extrapolation

training data 외의 영역에서 PDP로 부터 결론을 도출하는 것은 현명하지 않다. 여기서는 PDP에서 외삽(extrapolation)의 위험을 완화시키기 위한 두 가지 방법을 제시한다: rug displays와 convex hulls. Rug displays는 축에 추가되는 일차원 plot이다. partial과 plotPartial함수 모두 rug 옵션을 가지고 있고, TRUE로 설정되었을 때, 수평 및 수직 축에 예측 변수에 대한 분포의 deciles(최소값 및 최대값 포함)를 표시한다.

partial(boston.rf, pred.var = "lstat", plot = TRUE, rug = TRUE)

2차원 이상인 경우에는 convex hull(여러 개의 점이 주어졌을 때, 모든 점들을 포함하는 최소 크기의 볼록 다각형)을 그리는 것이 더 유익하다. convex hull은 모형이 훈련된 예측 변수 공간의 영역을 개괄적으로 보여준다. chull = TRUR 일 때, $z_s$의 처음 두 차원(pred.var에 지정된 처음 두 개의 예측변수)의 convex hull이 계산된다. 예를 들어 partial 함수를 호출할 때 chull = TRUE로 지정하면 처음 두 변수의 convex hull에 해당하는 영역만 그려진다. 이 영역 밖에서 PDP를 해석하려고 하는 것은 잘못된 생각이다.

partial(boston.rf, pred.var = c("lstat", "rm"), plot = TRUE, chull = TRUE)

convex hull

Addressing computational concerns

PDP를 구축하는 것은 계산적으로 비쌀 수 있다. 큰 데이터 세트에서 컴퓨팅 부담을 덜어주기 위해 몇 가지 전략을 사용할 수 있다. 예를 들어, 각각의 고유한 rm 값을 사용하여 cmedv의 partial dependence를 계산할 필요가 없다. partial dependence를 계산하는 포인트 수를 줄이더라도 매우 합리적인 결과를 얻을 수 있다. 현재 옵션은 관심 변수 범위에서 동일한 간격의 값을 가진 그리드를 사용하는 것이다. 그리드에 사용되는 값의 수는 partial 함수의 grid.resolution 옵션을 사용하여 조절할 수 있다. 또는, 사용자가 지정한 값의 그리드(예: 특정한 분위수 값들)를 pred.grid 인수를 통해 제공할 수 있다.

# Figure 5
grid.arrange(
  partial(boston.rf, "rm", plot = TRUE),
  partial(boston.rf, "rm", grid.resolution = 30, plot = TRUE),
  partial(boston.rf, "rm", pred.grid = data.frame(rm = 3:9), plot = TRUE),
  ncol = 3
)

Figure 5: Partial dependence of cmedv on rm. Left: Default plot. Middle: Using a reduced grid size. Right: Using a user-specified grid.

partial 함수는 R의 기본 루프보다는 plyr 패키지에 의존하는데 이것은 progress bar를 사용하거나 partial 함수를 병렬적으로 실행하는 것을 쉽게 해준다. 사실, partial 함수는 foreach 패키지에 의해 지원되는 모든 병렬 백앤드를 사용할 수 있다. 이 기능을 사용하려면 먼저 doMC나 doParallel 같은 병렬 백앤드를 로드하고 등록해야 한다.

예를 들기 위해 Breiman과 Friedman(1985)에 설명된 로스엔젤레스 오존 오염 데이터를 사용할 것이다. 1976년 로스엔젤레스 분지에서 330일 동안 오존 농도(오존)와 기상학적 변수 8개를 일별 측정 자료를 포함하고 있다. 다음 코드 청크는 데이터를 R에 로드한다.

ozone <- read.csv(paste0("http://statweb.stanford.edu/~tibs/ElemStatLearn/", "datasets/LAozone.data"), header = TRUE)
str(ozone)

## 'data.frame':    330 obs. of  10 variables:
##  $ ozone   : int  3 5 5 6 4 4 6 7 4 6 ...
##  $ vh      : int  5710 5700 5760 5720 5790 5790 5700 5700 5770 5720 ...
##  $ wind    : int  4 3 3 4 6 3 3 3 8 3 ...
##  $ humidity: int  28 37 51 69 19 25 73 59 27 44 ...
##  $ temp    : int  40 45 54 35 45 55 41 44 54 51 ...
##  $ ibh     : int  2693 590 1450 1568 2631 554 2083 2654 5000 111 ...
##  $ dpg     : int  -25 -24 25 15 -33 -28 23 -2 -19 9 ...
##  $ ibt     : int  87 128 139 121 123 182 114 91 92 173 ...
##  $ vis     : int  250 100 60 60 100 250 120 120 120 150 ...
##  $ doy     : int  3 4 5 6 7 8 9 10 11 12 ...

다음으로, Friedman(1991)에서 소개된 multivariate adaptive regression splines(MARS) 알고리즘을 사용해 오존 농도를 8개 기상 변수와 day of the year(doy) 변수의 비선형 함수로 모델링한다. 여기서는 3원 상호작용까지 허용한다.

library(earth) # for earth function (i.e., MARS algorithm)
ozone.mars <- earth(ozone ~ ., data = ozone, degree = 3)
summary(ozone.mars)

## Call: earth(formula=ozone~., data=ozone, degree=3)
## 
##                                    coefficients
## (Intercept)                          11.7516490
## h(58-temp)                           -0.1510377
## h(temp-58)                            0.4883275
## h(ibh-1105)                          -0.0009825
## h(200-vis)                            0.0176604
## h(59-doy)                            -0.1102233
## h(doy-59)                            -0.0161285
## h(5730-vh) * h(temp-58)              -0.0137530
## h(vh-5730) * h(temp-58)               0.0014756
## h(55-humidity) * h(temp-58)          -0.0216973
## h(temp-58) * h(dpg-52)               -0.0174308
## h(temp-58) * h(52-dpg)                0.0038824
## h(temp-69) * h(doy-59)               -0.0031742
## h(1105-ibh) * h(21-dpg)              -0.0000928
## h(wind-6) * h(temp-58) * h(52-dpg)   -0.0032786
## 
## Selected 15 of 21 terms, and 8 of 9 predictors
## Termination condition: Reached nk 21
## Importance: temp, ibh, humidity, dpg, doy, vis, wind, vh, ibt-unused
## Number of terms at each degree of interaction: 1 6 7 1
## GCV 13.62966    RSS 3569.979    GRSq 0.7882793    RSq 0.8309301

다음 세 변수의 상호작용을 고려렿려해보자.

• wind: wind speed (mph) at Los Angeles International Airport (LAX) • temp: temperature (oF) at Sandburg Air Force Base • dpg: the pressure gradient (mm Hg) from LAX to Dagget, CA

이 상호작용을 이해하기 위해 PDP를 사용할 수 있따. 하지만, 세 개의 연속형 변수들 사이의 partial dependence는 계산 비용이 크기 때문에 partial을 병렬로 실행할 것이다. 병렬 백앤드를 설정하는 것은 간단하다. 이는 아래 코드를 통해 확인할 수 있는데 윈도우나 유닉스 같은 시스템 모두에서 doParallel 패키지를 이용하여 partial 함수를 병렬로 실행할 수 있다.

library(doParallel) # load the parallel backend
cl <- makeCluster(4) # use 4 workers
registerDoParallel(cl) # register the parallel backend

이제 partial을 병렬로 실행하기 위해 우리가 해야 할 일은 parallel = TRUE, paropts 옵션을 주는 것이다.

Note: it is considered good practice to shut down the workers by calling stopCluster when finished.

partial(ozone.mars, pred.var = c("wind", "temp", "dpg"), plot = TRUE,
        chull = TRUE, parallel = TRUE, paropts = list(.packages = "earth")) # Figure 6

## Warning: <anonymous>: ... may be used in an incorrect context: '.fun(piece, ...)'

## Warning: <anonymous>: ... may be used in an incorrect context: '.fun(piece, ...)'

stopCluster(cl) # good practice

# Add a label to the colorkey
lattice::trellis.focus("legend", side = "right", clipp.off = TRUE, highlight = FALSE)
grid::grid.text("ozone", x = 0.2, y = 1.05, hjust = 0.5, vjust = 1)
lattice::trellis.unfocus()

Figure 6: Partial dependence of ozone on wind, temp, and dpg. Since dpg is continuous, it is first converted to a shingle; in this case, four groups with 10% overlap.

두 개 이상의 변수를 사용할 때, plotPartial은 격자 구조물(trellis)을 만든다는 것을 알 필요가 있다. pred.var에 지정된 첫 두 개의 변수는 각각 수평축과 수직축에 사용되고 다른 변수들은 패널을 정의한다. 패널 변수들이 연속형이면 eual count 알고리즘(?lattice::equal.count 참고)을 이용하여 shingles(A shingle is a special Trellis data structure that consists of a numeric vector along with intervals that define the “levels” of the shingle. The intervals may be allowed to overlap.)가 먼저 생성된다. 따라서 가능하면 고차원 디스플레이에서 패널을 정의할 때 범주형 변수를 사용하는 것이 더 효과적일 것이다.

Q. 이 부분 무슨 말인지 잘 모르겠음

Classification problems

분류문제의 경우 partail; dependence function은 logit과 유사한 스케일에 있다(T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Series in Statistics. Springer-Verlag, 2009. [p370]). 범주형 반응변수의 K 개의 수준이 있을 때 각각의 클래스에 대해서 다음을 계산한다:

\[f_k(x)=log[p_k(x)] - \frac{1}{K}\sum_{k=1}^K log[p_k(x)], \quad \quad k=1,2,\cdots,K\]

여기서 $p_k(x)$는 k번째 클래스에 대한 예측 확률이다. $f_k(x)$를 그리는 것은 k번째 클래스에 대한 log-odds가 어떻게 예측 변수들의 서로 다른 부분 집합에 의존하는지 이해할 수 있도록 도와준다.

이를 보이기 위해, dataset 패키지에 있는 iris 데이터를 사용한다. iris 데이터는 세 가지 종(setosa, versicolor, virginica)의 붓꽃(iris)에 대한 데이터이다. 각 종별로 50개씩 포함되어 있고, 꽃의 sepal length(꽃받침 길이), sepal width(꽃받침 너비), petal length(꽃잎 길이), petal width(꽃잎 너비)가 센티미터 단위로 측정되어 있다. e1071 패키지의 svm 함수를 이용하여 Gaussian radial basis 함수를 kernel로 사용한 SVM 모델을 적합한다.

str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

library(e1071) # for svm function
iris.svm <- svm(Species ~ ., data = iris, kernel = "radial", gamma = 0.75,
                cost = 0.25, probability = TRUE)

Note: the partial function has to be able to extract the predicted probabilities for each class, so it is necessary to set probability = TRUE in the call to svm.

그런 다음 각각의 클래스에 대해 Petal.Width와 Petal.Length에 대한 Species의 partial dependence plot을 그린다.

pd <- NULL
for (i in 1:3) {
  tmp <- partial(iris.svm, pred.var = c("Petal.Width", "Petal.Length"),
                 which.class = i, grid.resolution = 101, progress = "text")
  pd <- rbind(pd, cbind(tmp, Species = levels(iris$Species)[i]))
}

# Figure 7
library(ggplot2)
ggplot(pd, aes(x = Petal.Width, y = Petal.Length, z = yhat, fill = yhat)) +
  geom_tile() +
  geom_contour(color = "white", alpha = 0.5) +
  scale_fill_distiller(name = "Centered\nlogit", palette = "Spectral") +
  theme_bw() +
  facet_grid(~ Species)

Figure 7: Partial dependence of Species on Petal.Width and Petal.Length for the iris data.

User-defined prediction functions

PDP는 기본적으로 평균적인 예측이다. 이것은 Algorithm 1의 step1. (c)를 보면 명백하다. 따라서 Goldstein 등(2015)에서 지적한 바와 같이 강한 이질성은 반응 변수와 예측 변수 간의 모델링된 관계의 복잡성을 숨길 수 있다. 이 때문에 Goldstein, Kapelner, Bleich, and Pitkin이 ICE plot을 만든 동기이다.

partial 함수에서 Algorithm 1의 step1. (c)에서 mean 대신에 median이나 trimmed mean 같은 다른 함수를 사용하거나 PDP를 확률 척도의 분류 문제에 대한 PDP를 얻을 수 있다. ICE 곡선을 그리는 것도 가능하다. 이런 유연성은 partial 함수의 pred.fun 인자 때문에 가능하다(0.4.0 버전 부터). 이 인자는 ’object’와 ’newdata’라는 두 개의 인자가 필요한 예측 함수를 허용한다. 제공된 예측 함수는 단일 예측 또는 예측 벡터를 반환해야 한다. 모든 예측의 평균을 반환하면 전통적인 PDP가 된다.

예측 벡터(즉, 각 관측치에 대해 하나씩)를 반환하면 일련의 ICE 곡선이 생성된다. <- 이 부분 잘 모르겠음

‘pred.fun’ 인수를 사용하면 확률 척도의 분류 문제에 대한 PDP를 얻을 수 있다. 우리는 모든 관측치에 걸쳐 평균을 낸 관심있는 예측 클래스의 확률을 계산하는 함수만 작성하면 된다. 아래 함수는 section 2.5에서 사용한 iris 예제에 대한 SVM 모델에서 Setosa 클래스에 대한 예측 확률을 추출하는데 사용될 수 있다.

pred.prob <- function(object, newdata) { # see ?predict.svm
  pred <- predict(object, newdata, probability = TRUE)
  prob.setosa <- attr(pred, which = "probabilities")[, "setosa"]
  mean(prob.setosa)
}

그런 다음 partial 함수의 pred.fun 인자에 이 함수(pred.prob)를 넣어주면 된다. 아래 코드는 확률 척도에서 Petal.Width와 Petal.Length에 대한 PDP를 얻기 위해 pred.prob를 사용한다.

# PDPs for Petal.Width and Petal.Length on the probability scale
pdp.pw <- partial(iris.svm, pred.var = "Petal.Width", pred.fun = pred.prob,
                  plot = TRUE)
pdp.pl <- partial(iris.svm, pred.var = "Petal.Length", pred.fun = pred.prob,
                  plot = TRUE)
pdp.pw.pl <- partial(iris.svm, pred.var = c("Petal.Width", "Petal.Length"),
                     pred.fun = pred.prob, plot = TRUE)

# Figure 8
grid.arrange(pdp.pw, pdp.pl, pdp.pw.pl, ncol = 3)

Figure 8: Partial dependence of Species on Petal.Width and Petal.Length plotted on the probability scale; in this case, the probability of belonging to the setosa species.

회귀 문제를 위한 기본적인 예측 함수는 아래와 같다.

pred.fun <- function(object, newdata) {
  mean(predict(object, newdata), na.rm = TRUE)
}

이것은 Algorithm 1에서 step 1 (c)와 대응된다. ICE 곡선을 그리고 싶다고 하자. 이를 위해서는 예측 벡터(newdata에서 각 관측치에 대해 하나씩(즉, pred.fun에서 mean 함수에 대한 호출을 제거))를 반환하는 예측 함수를 사용해야 한다.

# Use partial to obtain ICE curves
pred.ice <- function(object, newdata) predict(object, newdata)

Note: when the function supplied to pred.fun returns multiple predictions, the data frame returned by partial includes an additional column, yhat.id, that indicates which curve a point belongs to; in the following code chunk, there will be one curve for each observation in boston.

rm.ice <- partial(boston.rf, pred.var = "rm", pred.fun = pred.ice)

# Figure 9
plotPartial(rm.ice, rug = TRUE, train = boston, alpha = 0.3)

Figure 9: ICE curves depicting the relationship between cmedv and rm for the Boston housing example. Each curve corresponds to a different observation.

Figure 9의 곡선은 적합된 모델의 이질성(heterogeneity)를 나타낸다(즉, 몇 개의 곡선은 상반되는 관계를 보여줌). 이러한 이질성은 c-ICE 곡선을 사용하여 쉽게 발견할 수 있다; Goldstein et al. (2015)의 Equation (4)를 참고. dplyr 패키지를 사용하면 c-ICE 곡선을 얻기 위해 ’partial’의 출력을 후처리하는 것은 다소 간단하다(longitudinal data에 대한 raw change score 구성과 유사하다).

# Post-process rm.ice to obtain c-ICE curves
library(dplyr) # for group_by and mutate functions
rm.ice <- rm.ice %>%
  group_by(yhat.id) %>% # perform next operation within each yhat.id
  mutate(yhat.centered = yhat - first(yhat)) # so each curve starts at yhat = 0

PDP가 단지 대응되는 ICE 곡선의 평균이기 때문에 하나의 플랏에 PDP와 ICE 곡선을 함께 표시하는 것은 간단하다. ggplot2 패키지의 stat_summary 함수를 사용하여 ICE 곡선을 평균 낼 수 있다.

The code snippet below plots the ICE curves and c-ICE curves, along with their averages, for the predictor rm in the Boston housing example. The results are displayed in Figure 10.

# ICE curves with their average
p1 <- ggplot(rm.ice, aes(rm, yhat)) +
  geom_line(aes(group = yhat.id), alpha = 0.2) +
  stat_summary(fun.y = mean, geom = "line", col = "red", size = 1)

# c-ICE curves with their average
p2 <- ggplot(rm.ice, aes(rm, yhat.centered)) +
  geom_line(aes(group = yhat.id), alpha = 0.2) +
  stat_summary(fun.y = mean, geom = "line", col = "red", size = 1)

# Figure 10
grid.arrange(p1, p2, ncol = 2)

Figure 10: ICE curves (black curves) and their average (red curve) depicting the relationship between cmedv and rm for the Boston housing example. Left: Uncentered (here the red curve is just the traditional PDP). Right: Centered.

Using partial with the XGBoost library

마지막으로 최근에 널리 사용되는 기계학습 도구인 XGBoost에 PDP를 적용하는 예시를 들 것이다. XGBoost는 eXtreme Gradient Boosting의 약자이며 R의 xgboost 패키지는 Kaggle competitions의 우승자들이 많이 사용한 것으로 알려져 있다. 이 패키지는 gbm 패키지보다 몇 배는 빠른 것으로 나타난다. gbm 패키지와는 다르게 xgboost 패키지에는 PDP를 그리기 위한 내장함수가 없는데 pdp 패키지를 이용하면 xgboost 패키지의 결과물로도 PDP를 그릴 수 있다.

예를 들기 위해 Bostn housing 데이터를 사용한다. xgboost의 모델을 튜닝하기 위해 caret 패키지를 통해 10-fold cross-validation을 사용하였다.

library(doParallel) # load the parallel backend
cl <- makeCluster(4) # use 4 workers
registerDoParallel(cl) # register the parallel backend

# Tune an XGBoost model using 10-fold cross-validation
library(caret) # functions related to classification and regression training
set.seed(202) # for reproducibility
boston.xgb <- train(x = data.matrix(subset(boston, select = -cmedv)),
                    y = boston$cmedv, method = "xgbTree", metric = "Rsquared",
                    trControl = trainControl(method = "cv", number = 10),
                    tuneLength = 10)
print(boston.xgb$bestTune)

##     nrounds max_depth eta gamma colsample_bytree min_child_weight
## 801      50         5 0.3     0              0.6                1
##     subsample
## 801       0.5

The optimal model had a cross-validated $R^2$ of 0.902 (use print(boston.xgb$bestTune) to view the optimum tuning parameters). The next snippet of code computes the partial dependence of cmedv on both rm and lstat, individually and together. The results are displayed in Figure 11.

# PDPs for lstat and rm
pdp.lstat <- partial(boston.xgb, pred.var = "lstat", plot = TRUE, rug = TRUE)
pdp.rm <- partial(boston.xgb, pred.var = "rm", plot = TRUE, rug = TRUE)
pdp.lstat.rm <- partial(boston.xgb, pred.var = c("lstat", "rm"),
plot = TRUE, chull = TRUE)

# Figure 11
grid.arrange(pdp.lstat, pdp.rm, pdp.lstat.rm, ncol = 3)

Figure 11: PDPs for the top two most important variables in the Boston housing data using xgboost. Compare this to the random forest results displayed in Figures 4-5.

The train function creates objects of class “train”, whereas the xgboost function creates objects of class “xgb.Booster”. Since train defaults to storing a copy of the training data as part of the “train” object, there is no need to supply it in the call to partial in this example. However, this is not the case when using the xgboost package directly. To illustrate, we fit the same model using the xgboost function with the optimum tuning parameters found previously using caret.

library(xgboost) # for xgboost function
set.seed(203) # for reproducibility
boston.xgb <- xgboost(data = data.matrix(subset(boston, select = -cmedv)),
                      label = boston$cmedv, objective = "reg:linear",
                      nrounds = 50, max_depth = 5, eta = 0.3, gamma = 0,
                      colsample_bytree = 0.6, min_child_weight = 1,
                      subsample = 0.5)

## [1]  train-rmse:17.048424 
## [2]  train-rmse:12.442534 
## [3]  train-rmse:9.236254 
## [4]  train-rmse:6.976089 
## [5]  train-rmse:5.544257 
## [6]  train-rmse:4.317732 
## [7]  train-rmse:3.652128 
## [8]  train-rmse:3.216508 
## [9]  train-rmse:2.848051 
## [10] train-rmse:2.588684 
## [11] train-rmse:2.428651 
## [12] train-rmse:2.310961 
## [13] train-rmse:2.233749 
## [14] train-rmse:2.166348 
## [15] train-rmse:1.970309 
## [16] train-rmse:1.905676 
## [17] train-rmse:1.831939 
## [18] train-rmse:1.723449 
## [19] train-rmse:1.664477 
## [20] train-rmse:1.624353 
## [21] train-rmse:1.583477 
## [22] train-rmse:1.556112 
## [23] train-rmse:1.511577 
## [24] train-rmse:1.457161 
## [25] train-rmse:1.409556 
## [26] train-rmse:1.381177 
## [27] train-rmse:1.342655 
## [28] train-rmse:1.317291 
## [29] train-rmse:1.295480 
## [30] train-rmse:1.266985 
## [31] train-rmse:1.243663 
## [32] train-rmse:1.215457 
## [33] train-rmse:1.202552 
## [34] train-rmse:1.166089 
## [35] train-rmse:1.133692 
## [36] train-rmse:1.115225 
## [37] train-rmse:1.080788 
## [38] train-rmse:1.050715 
## [39] train-rmse:1.019804 
## [40] train-rmse:0.997321 
## [41] train-rmse:0.969996 
## [42] train-rmse:0.928180 
## [43] train-rmse:0.905208 
## [44] train-rmse:0.884432 
## [45] train-rmse:0.866685 
## [46] train-rmse:0.848486 
## [47] train-rmse:0.832738 
## [48] train-rmse:0.818310 
## [49] train-rmse:0.796464 
## [50] train-rmse:0.785468

stopCluster(cl) # good practice

Summary

PDP는 낮은 차수의 예측변수들의 부분 집합에 대한 반응변수의 다른 예측변수들의 평균적인 효과를 고려한 의존성을 시각적으로 확인하는데 사용될 수 있다. 이 글에서는 pdp 패키지를 활용하여 R의 여러가지 블랙박스 모델에 대한 PDP를 어떻게 구성하는지 보였다. 또한 관련된 다른 R 패키지들에서도 간단하게 소개하였다. 외삽(extrapolation)과 긴 실행 시간을 피하기 위한 제안들은 예를 통해 논의되고 입증되었다.

pdp의 발전 방향을 다양하게 생각해 볼 수 있다. 예를 들어, censored response를 가진 조건부 랜덤 포리스트와 같이 블랙 박스 생존 모델을 위한 PDP를 구성하는 기능이 유용할 수 있다. 또한 예측 변수 간의 상호작용 강도를 평가하기 위해 partial dependence 기반의 H-statistic(Friedman and Popescu, 2008)를 구현하는 것이 바람직할 것이다.

pdp: An R Package for Constructing Partial Dependence Plots

stat17_hb @ korea.ac.kr