[14-03-12]

1. 단순 선형 회귀(Simple Linear Regression)

1.1 모델 생성

lm()함수를 사용해 선형 회귀를 만들 수 있다.
인자로 y ~ x같은 formula와 data를 받는다.

data(cars)
head(cars)

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

m <- lm(dist ~ speed, cars)
m

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##      -17.58         3.93

cars데이터에서 speed(차량의 속도)를 독립변수, dist(제동거리)를 종속변수로 하여 선형 회귀 분석을 실시했다.

1.2 선형회귀 결과 추출

회귀 계수

절편과 기울기를 보여주는 함수는 coef()이다.

coef(m)

## (Intercept)       speed 
##     -17.579       3.932

예측값(Fitted Values)

각 speed값에 대한 dist의 예측값은 fitted()로 구할 수 있다.
(책에서는 예측값이라고 했지만, 사실 적합값이라고 하는게 더 정확한 표현일 것이다.)

fitted(m)

##      1      2      3      4      5      6      7      8      9     10 
## -1.849 -1.849  9.948  9.948 13.880 17.813 21.745 21.745 21.745 25.677 
##     11     12     13     14     15     16     17     18     19     20 
## 25.677 29.610 29.610 29.610 29.610 33.542 33.542 33.542 33.542 37.475 
##     21     22     23     24     25     26     27     28     29     30 
## 37.475 37.475 37.475 41.407 41.407 41.407 45.339 45.339 49.272 49.272 
##     31     32     33     34     35     36     37     38     39     40 
## 49.272 53.204 53.204 53.204 53.204 57.137 57.137 57.137 61.069 61.069 
##     41     42     43     44     45     46     47     48     49     50 
## 61.069 61.069 61.069 68.934 72.866 76.799 76.799 76.799 76.799 80.731

잔차(Residuals)

예측값과 실제 dist간의 차이에 해당하는 잔차는 residuals()로 구한다.

residuals(m)

##        1        2        3        4        5        6        7        8 
##   3.8495  11.8495  -5.9478  12.0522   2.1198  -7.8126  -3.7450   4.2550 
##        9       10       11       12       13       14       15       16 
##  12.2550  -8.6774   2.3226 -15.6098  -9.6098  -5.6098  -1.6098  -7.5422 
##       17       18       19       20       21       22       23       24 
##   0.4578   0.4578  12.4578 -11.4746  -1.4746  22.5254  42.5254 -21.4070 
##       25       26       27       28       29       30       31       32 
## -15.4070  12.5930 -13.3394  -5.3394 -17.2719  -9.2719   0.7281 -11.2043 
##       33       34       35       36       37       38       39       40 
##   2.7957  22.7957  30.7957 -21.1367 -11.1367  10.8633 -29.0691 -13.0691 
##       41       42       43       44       45       46       47       48 
##  -9.0691  -5.0691   2.9309  -2.9339 -18.8663  -6.7987  15.2013  16.2013 
##       49       50 
##  43.2013   4.2689

잔차 = 실제값-예측값(기댓값, 적합값) 이므로 fitted(m) + residuals(m) = dist와 같다.

fitted(m) + residuals(m)

##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   2  10   4  22  16  10  18  26  34  17  28  14  20  24  28  26  34  34 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##  46  26  36  60  80  20  26  54  32  40  32  40  50  42  56  76  84  36 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50 
##  46  68  32  48  52  56  64  66  54  70  92  93 120  85

계수의 신뢰구간

계수는 정규분포를 따른다. 따라서 t분포를 사용한 신뢰구간을 confint()를 사용해 구할 수 있다.

confint(m)

##               2.5 % 97.5 %
## (Intercept) -31.168 -3.990
## speed         3.097  4.768

잔차 제곱 합

잔차 제곱 합인 SSE는 다음과 같이 deviance()로 구할 수 있다.
회귀분석에서 잔차제곱합은 매우 중요한 역할을 한다.

deviance(m)

## [1] 11354

1.3 예측과 신뢰구간

lm()을 통해 만들어진 모델은 predict()를 사용해 예측할 수 있다.
predict()는 generic function이다. 주어진 모델에 따라 내부적으로 predict.glm(), predict.lm(), predict.nls()등의 함수를 부르게 된다.
predict()는 인자로 모델, 그리고 예측을 수행할 새로운 데이터를 받는다

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##      -17.58         3.93

predict(m, newdata = data.frame(speed = 3))

##      1 
## -5.782

기존의 데이터가 아닌 새로운 데이터가 생겼을 때 값을 예측하는 목적이다.

절편과 기울기를 고려한 예측값의 신뢰구간은 다음과 같다.

predict(m, newdata = data.frame(speed = c(3)), interval = "confidence")

##      fit    lwr   upr
## 1 -5.782 -17.03 5.463

confidence는 신뢰구간을 구한다는 걸 뜻하며, prediction을 쓰면 예측구간을 구할 수 있다.

predict(m, newdata = data.frame(speed = c(3)), interval = "prediction")

##      fit    lwr   upr
## 1 -5.782 -38.69 27.12

예측구간의 길이가 신뢰구간의 길이보다 더 길다. 회귀분석의 이론을 보면 알 수 있다.

1.4 모형 평가

summary()를 사용해 선형 회귀의 결과를 손쉽게 평가할 수 있다.

summary(lm(dist ~ speed, data = cars))

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29.07  -9.53  -2.27   9.21  43.20 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.579      6.758   -2.60    0.012 *  
## speed          3.932      0.416    9.46  1.5e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.4 on 48 degrees of freedom
## Multiple R-squared:  0.651,  Adjusted R-squared:  0.644 
## F-statistic: 89.6 on 1 and 48 DF,  p-value: 1.49e-12

어떤 모형을 사용해 선형회귀를 수행했는지와 잔차, 회귀계수, 그리고 회귀모형의 평가의 내용을 담고 있다.

1.5 ANOVA 및 모델간의 비교

summary()가 보여주는 f 통계량은 anova)함수를 사용해 직접 구할 수 있다.

anova(lm(dist ~ speed, data = cars))

## Analysis of Variance Table
## 
## Response: dist
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## speed      1  21185   21185    89.6 1.5e-12 ***
## Residuals 48  11354     237                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

table을 보면 Sum sq라는 부분이 제곱합을 나타내며, speed열은 SSR, Residuals는 SSE, SSR + SSE = SST로 쉽게 구할 수 있다.

완전 모형과 축소 모형을 직접 비교해 볼 수 있다.

full <- lm(dist ~ speed, data = cars)
reduced <- lm(dist ~ 1, data = cars)
full

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##      -17.58         3.93

reduced

## 
## Call:
## lm(formula = dist ~ 1, data = cars)
## 
## Coefficients:
## (Intercept)  
##          43

여기서 1은 절편(intercept)를 표현하기 위해 사용된 것이다.

anova로 두 모델을 비교할 수 있다.

anova(reduced, full)

## Analysis of Variance Table
## 
## Model 1: dist ~ 1
## Model 2: dist ~ speed
##   Res.Df   RSS Df Sum of Sq    F  Pr(>F)    
## 1     49 32539                              
## 2     48 11354  1     21185 89.6 1.5e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F통계량은 두 모델의 SSR과 자유도를 통해 구할 수 있으며, 두 모델의 차이가 있는지를 검정하기 위한 통계량이다.

1.6 모델 평가 플롯

단순히 plot(m)을 통해 쉽게 차트를 그려볼 수 있다.

par(mfrow = c(2, 2))
plot(m)

plot of chunk unnamed-chunk-15

첫번째 플롯은 잔차 플롯으로, dist의 적합값과 잔차를 그린 플롯이다. 잔차의 등분산성과 독립성을 검정하기 위한 플롯이다.
두번째 플롯은 정규 플롯으로, 잔차의 정규성을 검정하기 위한 플롯이다.
세번째 플롯은 표준화 잔차 플롯으로, 잔차플롯과 비슷하다.
네번째 플롯은 지레-잔차 플롯으로, X값과 Y값의 특이값을 찾아내는데 유용한 플롯이다.

plot()은 generic function이믈 인자로 선형 회귀 모델을 주면 plot.lm()이 자동으로 호출된다. plot.lm()이 그리는 차트는 위의 4가지 외에도 2가지가 더있다.

par(mfrow = c(1, 2))
plot(m, which = c(4, 6))

plot of chunk unnamed-chunk-16

첫번째 플롯은 관찰값별 Cook's Distance(Cook의 거리)를 구한 것으로, 이상치를 판별하는데 사용된다. 두번째 플롯은 Cook의 거리와 지레값을 플롯한 것으로, X공간과 Y공간의 이상치를 동시에 판별하는데 사용된다.

1.7 회귀 직선의 시각화

plot()을 이용해 산점도를 그리고 그 위에 abline()으로 회귀직선을 그린다.

with(cars, plot(speed, dist))
abline(coef(lm(dist ~ speed, data = cars)))

plot of chunk unnamed-chunk-17

추정값의 신뢰대를 포함하여 그리는 방법은 다음과 같다.
- 먼저 speed의 최소값부터 최대값까지 일정한 간격을 두어 데이터를 만든다.
- 이 값의 범위에 대해 신뢰구간을 구한다. 신뢰구간은 predict()를 이용해 구한다.
- 마지막으로 예측값과 상한, 하한이 저장된 행렬을 데이터에 그려 차트를 완성한다. matplot()을 사용하면 된다.

speed <- seq(min(cars$speed), max(cars$speed), 0.1)
ys <- predict(lm(dist ~ speed, data = cars), newdata = data.frame(speed = speed), 
    interval = "confidence")
matplot(speed, ys, type = "l")

plot of chunk unnamed-chunk-18

matplot타입은 (“p”, points; “l”, lines; “b”, both; “n”, none; or “h”, high-density)이다.

2. 중선형회귀(Multiple Linear Regression)

2.1 모델 생성 및 평가

설명변수(독립변수, 예측변수)를 “+"로 연결해 나열하면서 모형을 생성한다.

m <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris)
summary(m)

## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
##     data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8282 -0.2199  0.0187  0.1971  0.8457 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.8560     0.2508    7.40  9.9e-12 ***
## Sepal.Width    0.6508     0.0666    9.77  < 2e-16 ***
## Petal.Length   0.7091     0.0567   12.50  < 2e-16 ***
## Petal.Width   -0.5565     0.1275   -4.36  2.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.315 on 146 degrees of freedom
## Multiple R-squared:  0.859,  Adjusted R-squared:  0.856 
## F-statistic:  296 on 3 and 146 DF,  p-value: <2e-16

2.2 범주형 변수

Species같은 범주형 변수를 설명변수에 포함시킨다.

m <- lm(Sepal.Length ~ ., data = iris)
summary(m)

## 
## Call:
## lm(formula = Sepal.Length ~ ., data = iris)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.794 -0.219  0.009  0.203  0.731 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.1713     0.2798    7.76  1.4e-12 ***
## Sepal.Width         0.4959     0.0861    5.76  4.9e-08 ***
## Petal.Length        0.8292     0.0685   12.10  < 2e-16 ***
## Petal.Width        -0.3152     0.1512   -2.08   0.0389 *  
## Speciesversicolor  -0.7236     0.2402   -3.01   0.0031 ** 
## Speciesvirginica   -1.0235     0.3337   -3.07   0.0026 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.307 on 144 degrees of freedom
## Multiple R-squared:  0.867,  Adjusted R-squared:  0.863 
## F-statistic:  188 on 5 and 144 DF,  p-value: <2e-16

Species를 추가한 모델에는 두개의 계수가 더 보인다. setosa가 없는 이유는 Species를 2개의 가변수를 사용해 표편할 때, 두개 값 모두 0이기 때문이다.

anova()를 사용해 분산분석 결과를 살펴보자.

anova(m)

## Analysis of Variance Table
## 
## Response: Sepal.Length
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## Sepal.Width    1    1.4     1.4   15.00 0.00016 ***
## Petal.Length   1   84.4    84.4  896.81 < 2e-16 ***
## Petal.Width    1    1.9     1.9   20.01 1.6e-05 ***
## Species        2    0.9     0.4    4.72 0.01033 *  
## Residuals    144   13.6     0.1                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary()와는 다르게 Species가 하나의 설명변수로 묶여서 표시된다.

2.3 중선형회귀모형의 시각화

2차원 이상의 데이터는 시각화하기 쉽지 않으므로, 설명변수로 Species와 Sepal.Width만 사용해 시각화를 해보자.

with(iris, plot(Sepal.Width, Sepal.Length, cex = 0.7, pch = as.numeric(Species)))
legend("topright", legend = levels(iris$Species), pch = 1:3, bg = "white")

plot of chunk unnamed-chunk-22

levels()는 범주형 변수의 각 수준의 이름을 반환한다.

이제 회귀직선을 그려보자.

m <- lm(Sepal.Length ~ Sepal.Width + Species, data = iris)
abline(2.25, 0.8, lty = 1)

## Error: plot.new has not been called yet

abline(2.25 + 1.45, 0.8, lty = 2)

## Error: plot.new has not been called yet

abline(2.25 + 1.94, 0.8, lty = 3)

## Error: plot.new has not been called yet

abline()의 첫번째 인자는 절편, 두번째 인자는 기울기이며 lty는 선의 유형을 지정한다.

2.4 표현식을 위한 I()의 사용

I()안에 원하는 수식을 집어넣어서, 종속변수와 독립변수의 관계를 1차식이 아닌 다른 형태로 표현할 수 있다.

x <- 1:1000
y <- x^2 + 3 * x + 5 + rnorm(1000)
lm(y ~ I(x^2) + x)

## 
## Call:
## lm(formula = y ~ I(x^2) + x)
## 
## Coefficients:
## (Intercept)       I(x^2)            x  
##        4.94         1.00         3.00

lm(y ~ x^2 + x)

## 
## Call:
## lm(formula = y ~ x^2 + x)
## 
## Coefficients:
## (Intercept)            x  
##     -167162         1004

rnorm()은 오차항이 N(0,1)을 따르기 때문에 지정한 것이다.
I(x^2)는 실제로 x^2로 해석되지만, x^2는 상호작용으로 해석되기 때문에 두 결과가 다르다.

다른 예시를 살펴보자.

x1 <- 1:1000
x2 <- 3 * x1
y <- 3 * (x1 + x2) + rnorm(1000)
lm(y ~ I(x1 + x2))

## 
## Call:
## lm(formula = y ~ I(x1 + x2))
## 
## Coefficients:
## (Intercept)   I(x1 + x2)  
##     -0.0188       3.0000

2.5 변수의 변환

변수를 변환한후 formula에 곧바로 넣으면 된다.

x <- 101:200
y <- exp(3 * x + rnorm(100))
lm(log(y) ~ x)

## 
## Call:
## lm(formula = log(y) ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      -0.345        3.001

설명변수에 로그를 취하는 예를 보자.

x <- 101:200
y <- log(x) + rnorm(100)
lm(y ~ log(x))

## 
## Call:
## lm(formula = y ~ log(x))
## 
## Coefficients:
## (Intercept)       log(x)  
##       -1.19         1.20

2.6 상호 작용

lm()의 formula로 다음과 같이 표현한다.
- y ~ x1 + x2 + x1:x2
- y ~ x1 * x2
3개 이상의 변수 A, B, C가 있다면 어떻게 될까?
- A + B + C + A:B + A:C + B:C + A:B:C
- A * B * C
- 최대 2개까지 상호작용한다면 (A + B + C)²
4개 이상의 변수 A, B, C, D가 있다면
- A * B * C
- 최대 3개까지 상호작용한다면 (A + B + C + D)³

데이터셋에 적용해보자.

data(Orange)
with(Orange, plot(Tree, circumference, xlab = "tree", ylab = "circumference"))

plot of chunk unnamed-chunk-28

상호작용 그림(interaction plot)을 그려보자.

with(Orange, interaction.plot(age, Tree, circumference))

plot of chunk unnamed-chunk-29

lm()을 사용해 다중선형회귀를 수행하기 전에 Tree열을 순서가 없는 범주형 변수로 바꿔줘야한다.

Orange[, "fTree"] <- factor(Orange[, "Tree"], ordered = FALSE)
m <- lm(circumference ~ fTree * age, data = Orange)
summary(m)

## 
## Call:
## lm(formula = circumference ~ fTree * age, data = Orange)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18.06  -6.64  -1.48   8.07  16.65 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.92e+01   8.46e+00    2.27  0.03206 *  
## fTree1       5.23e+00   1.20e+01    0.44  0.66544    
## fTree5      -1.04e+01   1.20e+01   -0.87  0.39086    
## fTree2       7.57e-01   1.20e+01    0.06  0.95002    
## fTree4      -4.57e+00   1.20e+01   -0.38  0.70590    
## age          8.11e-02   8.12e-03    9.99  3.3e-10 ***
## fTree1:age   3.66e-04   1.15e-02    0.03  0.97485    
## fTree5:age   2.99e-02   1.15e-02    2.61  0.01523 *  
## fTree2:age   4.40e-02   1.15e-02    3.83  0.00077 ***
## fTree4:age   5.41e-02   1.15e-02    4.71  7.9e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.4 on 25 degrees of freedom
## Multiple R-squared:  0.976,  Adjusted R-squared:  0.967 
## F-statistic:  112 on 9 and 25 DF,  p-value: <2e-16

anova(m)

## Analysis of Variance Table
## 
## Response: circumference
##           Df Sum Sq Mean Sq F value  Pr(>F)    
## fTree      4  11841    2960   27.30 8.4e-09 ***
## age        1  93772   93772  864.73 < 2e-16 ***
## fTree:age  4   4043    1011    9.32 9.4e-05 ***
## Residuals 25   2711     108                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

상호작용이 있을때, model.matrix는 상호작용을 고려한 열들을 추가로 데이터에 담고 있다.

mm <- model.matrix(m)
head(mm)

##   (Intercept) fTree1 fTree5 fTree2 fTree4  age fTree1:age fTree5:age
## 1           1      1      0      0      0  118        118          0
## 2           1      1      0      0      0  484        484          0
## 3           1      1      0      0      0  664        664          0
## 4           1      1      0      0      0 1004       1004          0
## 5           1      1      0      0      0 1231       1231          0
## 6           1      1      0      0      0 1372       1372          0
##   fTree2:age fTree4:age
## 1          0          0
## 2          0          0
## 3          0          0
## 4          0          0
## 5          0          0
## 6          0          0

3. 이상치(outlier)

이상치 검출은 잔차, 특히 외면 스튜던트화 잔차를 사용한다.
rstudent()를 사용해 잔차를 구한다.

m <- lm(circumference ~ age + I(age), data = Orange)
rstudent(m)

##          1          2          3          4          5          6 
##  6.372e-05 -4.736e-01 -5.474e-02 -4.051e-01 -1.250e+00 -9.461e-01 
##          7          8          9         10         11         12 
## -1.885e+00  1.318e-01 -3.259e-03  9.738e-01  1.360e+00  9.960e-01 
##         13         14         15         16         17         18 
##  1.745e+00  7.284e-01  6.372e-05 -7.774e-01 -5.647e-01 -7.043e-01 
##         19         20         21         22         23         24 
## -1.481e+00 -1.080e+00 -2.144e+00  8.788e-02 -3.019e-01  1.018e+00 
##         25         26         27         28         29         30 
##  1.882e+00  1.311e+00  2.045e+00  1.226e+00  6.372e-05 -8.653e-01 
##         31         32         33         34         35 
## -3.088e-01  1.697e-02 -2.897e-01  4.323e-01 -4.040e-01

이상치는 car::outlierTest()를 사용해 간단하게 구할 수 있다.

library(car)

## Warning: package 'car' was built under R version 3.0.3

outlierTest(m)

## 
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##    rstudent unadjusted p-value Bonferonni p
## 21   -2.144            0.03976           NA

23번 데이터가 가장 큰 rstudent 값을 가지고 있지만 이상치는 발견되지 않았다. 이상치를 임의로 추가해서 어떻게 결과가 달라지는지 살펴보자.

data(Orange)
Orange <- rbind(Orange, data.frame(Tree = as.factor(c(6, 6, 6)), age = c(118, 
    484, 664), circumference = c(177, 50, 30)))
tail(Orange)

##    Tree  age circumference
## 33    5 1231           142
## 34    5 1372           174
## 35    5 1582           177
## 36    6  118           177
## 37    6  484            50
## 38    6  664            30

m <- lm(circumference ~ age + I(age^2), data = Orange)
outlierTest(m)

##    rstudent unadjusted p-value Bonferonni p
## 36    5.538          3.429e-06    0.0001303

36번째 데이터가 이상치로 검출되었다.

4. 변수 선택

4.1 단계적 변수 선택

중선형 회귀 모형에서 설명변수를 선택하는 방법은 다음과 같다.
- 전진 선택법은 절편만 있는 모델에서 기준 통계치를 가장 많이 개선시키는 변수를 차례로 추가하는 방법이다.
- 변수 소거법은 모든 변수가 포함된 모델에서 기준 통계치에 가장 도움이 되지 않는 변수를 하나씩 제거하는 방법이다.
- 단계적 방법은 모든 변수가 포함된 모델에서 출발한다. 그리고 기준 통계치에 가장 도움이 안되는 변수를 제거하거나 빠져있는 변수 중에서 기준 통계치를 가장 개선시키는 변수를 추가한다. 이런 추가와 제거를 반복하는 방법니다.
step(모델, direction="선택방법”“)을 사용해 변수 선택을 수행한다.
- 전진 선택법은 forward, 변수 소거법은 backward, 단계적 방법은 both이다.

library(mlbench)
data(BostonHousing)
head(BostonHousing)

##      crim zn indus chas   nox    rm  age   dis rad tax ptratio     b lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.090   1 296    15.3 396.9  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.967   2 242    17.8 396.9  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.967   2 242    17.8 392.8  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.062   3 222    18.7 394.6  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.062   3 222    18.7 396.9  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.062   3 222    18.7 394.1  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7


m <- lm(medv ~ ., data = BostonHousing)
m2 <- step(m, direction = "both")

## Start:  AIC=1590
## medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
##     tax + ptratio + b + lstat
## 
##           Df Sum of Sq   RSS  AIC
## - age      1         0 11079 1588
## - indus    1         3 11081 1588
## <none>                 11079 1590
## - chas     1       219 11298 1598
## - tax      1       242 11321 1599
## - crim     1       243 11322 1599
## - zn       1       257 11336 1599
## - b        1       271 11349 1600
## - rad      1       479 11558 1609
## - nox      1       487 11566 1609
## - ptratio  1      1194 12273 1639
## - dis      1      1232 12311 1641
## - rm       1      1871 12950 1667
## - lstat    1      2411 13490 1687
## 
## Step:  AIC=1588
## medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + 
##     ptratio + b + lstat
## 
##           Df Sum of Sq   RSS  AIC
## - indus    1         3 11081 1586
## <none>                 11079 1588
## + age      1         0 11079 1590
## - chas     1       220 11299 1596
## - tax      1       242 11321 1597
## - crim     1       243 11322 1597
## - zn       1       260 11339 1597
## - b        1       272 11351 1598
## - rad      1       481 11560 1607
## - nox      1       521 11600 1609
## - ptratio  1      1200 12279 1638
## - dis      1      1352 12431 1644
## - rm       1      1960 13038 1668
## - lstat    1      2719 13798 1697
## 
## Step:  AIC=1586
## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
##     b + lstat
## 
##           Df Sum of Sq   RSS  AIC
## <none>                 11081 1586
## + indus    1         3 11079 1588
## + age      1         0 11081 1588
## - chas     1       227 11309 1594
## - crim     1       245 11327 1595
## - zn       1       258 11339 1595
## - b        1       271 11352 1596
## - tax      1       274 11355 1596
## - rad      1       501 11582 1606
## - nox      1       542 11623 1608
## - ptratio  1      1206 12288 1636
## - dis      1      1449 12530 1646
## - rm       1      1964 13045 1666
## - lstat    1      2723 13805 1695

step()의 첫번째 출력결과는 다음과 같다 <1>

가장 첫줄은 모델을 나타내며, 그때의 AIC값이 1589.64이다.
그 뒤에 나열된 결과는 각 변수를 삭제 했을때 (”- 설명변수명"의 형식으로 표시) AIC의 변화를 표현하고 있다.
AIC가 가장 작아지는 age를 제거한후 step()함수는 다시 한번 단계적 방법을 수행한다. 두번째 결과는 다음과 같다 <2> 변수에서 age가 제거된 모델이 나타나며, AIC값이 나타난다.
이번에는 각 변수를 제거하는 것 뿐아니라, 제거된 변수를 추가했을 경우까지 검토한다.

이런식으로 단계적으로 반복하여 최종모델을 결정 지은 것이다.

formula(m2)

## medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
##     b + lstat

4.2 모든 경우에 대한 비교

N개의 설명변수가 있다면 2^N개의 회귀 모델을 만들 수 있고, 이 들을 모두 비교해 볼 수도 있다.
leaps::regsubsets()을 사용한다.

library(leaps)
m <- regsubsets(medv ~ ., data = BostonHousing)
summary(m)

## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = BostonHousing)
## 13 Variables  (and intercept)
##         Forced in Forced out
## crim        FALSE      FALSE
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas1       FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## b           FALSE      FALSE
## lstat       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          crim zn  indus chas1 nox rm  age dis rad tax ptratio b   lstat
## 1  ( 1 ) " "  " " " "   " "   " " " " " " " " " " " " " "     " " "*"  
## 2  ( 1 ) " "  " " " "   " "   " " "*" " " " " " " " " " "     " " "*"  
## 3  ( 1 ) " "  " " " "   " "   " " "*" " " " " " " " " "*"     " " "*"  
## 4  ( 1 ) " "  " " " "   " "   " " "*" " " "*" " " " " "*"     " " "*"  
## 5  ( 1 ) " "  " " " "   " "   "*" "*" " " "*" " " " " "*"     " " "*"  
## 6  ( 1 ) " "  " " " "   "*"   "*" "*" " " "*" " " " " "*"     " " "*"  
## 7  ( 1 ) " "  " " " "   "*"   "*" "*" " " "*" " " " " "*"     "*" "*"  
## 8  ( 1 ) " "  "*" " "   "*"   "*" "*" " " "*" " " " " "*"     "*" "*"

제일 하단에 1,2,3,…,8로 표시된 부분은 변수의 갯수고, “*"는 변수가 해당 갯수만큼 사용되었을때 최적의 모델에 포함된 변수들을 표시한다.
예를 들어 첫줄에는, 변수를 하나만 포함시킨다면 lstat를 포함한 모델이 가장 좋다는 의미이다. 두번째 줄에서는, 변수를 두개만 포함시킨다면 rm, lstat을 포함한 모델이 가장 우수하다는 것을 알 수 있다.

regsubsets()의 결과에 summary()를 적용하면 BIC, Adjusted R squared 등을 알 수 있다.

str(summary(regsubsets(medv ~ ., data = BostonHousing)))

## List of 8
##  $ which : logi [1:8, 1:14] TRUE TRUE TRUE TRUE TRUE TRUE ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:14] "(Intercept)" "crim" "zn" "indus" ...
##  $ rsq   : num [1:8] 0.544 0.639 0.679 0.69 0.708 ...
##  $ rss   : num [1:8] 19472 15439 13728 13229 12469 ...
##  $ adjr2 : num [1:8] 0.543 0.637 0.677 0.688 0.705 ...
##  $ cp    : num [1:8] 362.8 185.6 111.6 91.5 59.8 ...
##  $ bic   : num [1:8] -385 -496 -549 -562 -586 ...
##  $ outmat: chr [1:8, 1:13] " " " " " " " " ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:8] "1  ( 1 )" "2  ( 1 )" "3  ( 1 )" "4  ( 1 )" ...
##   .. ..$ : chr [1:13] "crim" "zn" "indus" "chas1" ...
##  $ obj   :List of 28
##   ..$ np       : int 14
##   ..$ nrbar    : int 91
##   ..$ d        : num [1:14] 5.06e+02 2.49e+02 2.14e+03 3.18e+01 3.81e+06 ...
##   ..$ rbar     : num [1:91] 6.2846 3.795 0.0692 356.674 408.2372 ...
##   ..$ thetab   : num [1:14] 22.5328 9.1021 0.4888 4.6322 0.0229 ...
##   ..$ first    : int 2
##   ..$ last     : int 14
##   ..$ vorder   : int [1:14] 1 7 9 5 13 11 12 4 6 8 ...
##   ..$ tol      : num [1:14] 1.12e-08 8.45e-08 1.01e-07 4.15e-09 6.07e-06 ...
##   ..$ rss      : num [1:14] 42716 22062 21549 20866 18868 ...
##   ..$ bound    : num [1:14] 42716 19472 15439 13728 13229 ...
##   ..$ nvmax    : int 9
##   ..$ ress     : num [1:9, 1] 42716 19472 15439 13728 13229 ...
##   ..$ ir       : int 9
##   ..$ nbest    : int 1
##   ..$ lopt     : int [1:45, 1] 1 1 14 1 14 7 1 14 7 12 ...
##   ..$ il       : int 45
##   ..$ ier      : int 0
##   ..$ xnames   : chr [1:14] "(Intercept)" "crim" "zn" "indus" ...
##   ..$ method   : chr "exhaustive"
##   ..$ force.in : Named logi [1:14] TRUE FALSE FALSE FALSE FALSE FALSE ...
##   .. ..- attr(*, "names")= chr [1:14] "" "crim" "zn" "indus" ...
##   ..$ force.out: Named logi [1:14] FALSE FALSE FALSE FALSE FALSE FALSE ...
##   .. ..- attr(*, "names")= chr [1:14] "" "crim" "zn" "indus" ...
##   ..$ sserr    : num 11079
##   ..$ intercept: logi TRUE
##   ..$ lindep   : logi [1:14] FALSE FALSE FALSE FALSE FALSE FALSE ...
##   ..$ nullrss  : num 42716
##   ..$ nn       : int 506
##   ..$ call     : language regsubsets.formula(medv ~ ., data = BostonHousing)
##   ..- attr(*, "class")= chr "regsubsets"
##  - attr(*, "class")= chr "summary.regsubsets"

summary(regsubsets(medv ~ ., data = BostonHousing))$bic

## [1] -385.1 -496.3 -549.5 -562.0 -585.7 -593.0 -598.2 -600.2

summary(regsubsets(medv ~ ., data = BostonHousing))$adjr2

## [1] 0.5432 0.6371 0.6767 0.6878 0.7052 0.7124 0.7183 0.7222

plot()을 사용해 시각화 해볼 수도 있다.

plot(regsubsets(medv ~ ., data = BostonHousing), scale = "adjr2")

plot of chunk unnamed-chunk-39

scale인자에 "bic”(BIC), “Cp”(Cp), “adjr2”(Adjusted R squared), “r2”(R squared)를 줘서 그려볼 수 있다.