The attached “fat.csv” dataset (Penrose et al. ,1985) contains 252 observations and 16 variables, and is described below:
The percentage of body fat is a measure to assess a person’s health and is measured through an underwater weighing technique. Build a formula to predict an individual’s body fat, based on variables in the dataset by using (1) best subset approach based on adjusted R^2, Mallows Cp, BIC (2) minimizing test MSE based on validation set method (3) minimizing test MSE based on LOOCV (4) minimizing test MSE based on 10-fold CV
# Best subset Selection
fat = read.csv('/Users/statstics/Downloads/fat.csv',header = T) # Mac 경로
names(fat)
## [1] "brozek" "age" "weight" "height" "adipos" "free" "neck"
## [8] "chest" "abdom" "hip" "thigh" "knee" "ankle" "biceps"
## [15] "forearm" "wrist"
dim(fat)
## [1] 252 16
str(fat)
## 'data.frame': 252 obs. of 16 variables:
## $ brozek : num 12.6 6.9 24.6 10.9 27.8 20.6 19 12.8 5.1 12 ...
## $ age : int 23 22 22 26 24 24 26 25 25 23 ...
## $ weight : num 154 173 154 185 184 ...
## $ height : num 67.8 72.2 66.2 72.2 71.2 ...
## $ adipos : num 23.7 23.4 24.7 24.9 25.6 26.5 26.2 23.6 24.6 25.8 ...
## $ free : num 135 161 116 165 133 ...
## $ neck : num 36.2 38.5 34 37.4 34.4 39 36.4 37.8 38.1 42.1 ...
## $ chest : num 93.1 93.6 95.8 101.8 97.3 ...
## $ abdom : num 85.2 83 87.9 86.4 100 94.4 90.7 88.5 82.5 88.6 ...
## $ hip : num 94.5 98.7 99.2 101.2 101.9 ...
## $ thigh : num 59 58.7 59.6 60.1 63.2 66 58.4 60 62.9 63.1 ...
## $ knee : num 37.3 37.3 38.9 37.3 42.2 42 38.3 39.4 38.3 41.7 ...
## $ ankle : num 21.9 23.4 24 22.8 24 25.6 22.9 23.2 23.8 25 ...
## $ biceps : num 32 30.5 28.8 32.4 32.2 35.7 31.9 30.5 35.9 35.6 ...
## $ forearm: num 27.4 28.9 25.2 29.4 27.7 30.6 27.8 29 31.1 30 ...
## $ wrist : num 17.1 18.2 16.6 18.2 17.7 18.8 17.7 18.8 18.2 19.2 ...
summary(fat)
## brozek age weight height
## Min. : 0.00 Min. :22.00 Min. :118.5 Min. :29.50
## 1st Qu.:12.80 1st Qu.:35.75 1st Qu.:159.0 1st Qu.:68.25
## Median :19.00 Median :43.00 Median :176.5 Median :70.00
## Mean :18.94 Mean :44.88 Mean :178.9 Mean :70.15
## 3rd Qu.:24.60 3rd Qu.:54.00 3rd Qu.:197.0 3rd Qu.:72.25
## Max. :45.10 Max. :81.00 Max. :363.1 Max. :77.75
## adipos free neck chest
## Min. :18.10 Min. :105.9 Min. :31.10 Min. : 79.30
## 1st Qu.:23.10 1st Qu.:131.3 1st Qu.:36.40 1st Qu.: 94.35
## Median :25.05 Median :141.6 Median :38.00 Median : 99.65
## Mean :25.44 Mean :143.7 Mean :37.99 Mean :100.82
## 3rd Qu.:27.32 3rd Qu.:153.9 3rd Qu.:39.42 3rd Qu.:105.38
## Max. :48.90 Max. :240.5 Max. :51.20 Max. :136.20
## abdom hip thigh knee
## Min. : 69.40 Min. : 85.0 Min. :47.20 Min. :33.00
## 1st Qu.: 84.58 1st Qu.: 95.5 1st Qu.:56.00 1st Qu.:36.98
## Median : 90.95 Median : 99.3 Median :59.00 Median :38.50
## Mean : 92.56 Mean : 99.9 Mean :59.41 Mean :38.59
## 3rd Qu.: 99.33 3rd Qu.:103.5 3rd Qu.:62.35 3rd Qu.:39.92
## Max. :148.10 Max. :147.7 Max. :87.30 Max. :49.10
## ankle biceps forearm wrist
## Min. :19.1 Min. :24.80 Min. :21.00 Min. :15.80
## 1st Qu.:22.0 1st Qu.:30.20 1st Qu.:27.30 1st Qu.:17.60
## Median :22.8 Median :32.05 Median :28.70 Median :18.30
## Mean :23.1 Mean :32.27 Mean :28.66 Mean :18.23
## 3rd Qu.:24.0 3rd Qu.:34.33 3rd Qu.:30.00 3rd Qu.:18.80
## Max. :33.9 Max. :45.00 Max. :34.90 Max. :21.40
sum(is.na(fat$brozek))
## [1] 0
fat=na.omit(fat)
dim(fat)
## [1] 252 16
sum(is.na(fat))
## [1] 0
이 자료는 252개 관측개체와 16개 변수로 구성되어 있으며, fat.csv에는 결측값이 없음을 확인하였다.
library(leaps)
regfit.full=regsubsets(brozek~.,data=fat)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(brozek ~ ., data = fat)
## 15 Variables (and intercept)
## Forced in Forced out
## age FALSE FALSE
## weight FALSE FALSE
## height FALSE FALSE
## adipos FALSE FALSE
## free FALSE FALSE
## neck FALSE FALSE
## chest FALSE FALSE
## abdom FALSE FALSE
## hip FALSE FALSE
## thigh FALSE FALSE
## knee FALSE FALSE
## ankle FALSE FALSE
## biceps FALSE FALSE
## forearm FALSE FALSE
## wrist FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## age weight height adipos free neck chest abdom hip thigh knee
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " "
## 2 ( 1 ) " " "*" " " " " "*" " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " "*" " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " "*" " " " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " "*" "*" " " " " "*" " " " " " "
## 6 ( 1 ) " " "*" " " "*" "*" " " " " "*" " " "*" " "
## 7 ( 1 ) " " "*" " " "*" "*" " " "*" "*" " " "*" " "
## 8 ( 1 ) " " "*" " " "*" "*" " " "*" "*" " " "*" " "
## ankle biceps forearm wrist
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " "*" " "
## 4 ( 1 ) " " " " "*" " "
## 5 ( 1 ) " " " " "*" " "
## 6 ( 1 ) " " " " "*" " "
## 7 ( 1 ) " " " " "*" " "
## 8 ( 1 ) "*" " " "*" " "
full model regression에서 반응변수 brozek(bodyfat)에 영향을 주는 모든 예측변수에 대응시켜보았다. 예측변수(predictor)중 가장 높은 두 가지 변수는 weight와 free 이다. 그런데 여기서 regsubsets() 에서 nvmax값의 default는 8 이므로 최대한 늘려서 확인해보기로 하였다. 본 모델의 예측 변수가 15개 이므로 15개로 잡고 다시 함수를 실행하였다.
regfit.full=regsubsets(brozek~.,data=fat,nvmax=15)
reg.summary=summary(regfit.full)
reg.summary
## Subset selection object
## Call: regsubsets.formula(brozek ~ ., data = fat, nvmax = 15)
## 15 Variables (and intercept)
## Forced in Forced out
## age FALSE FALSE
## weight FALSE FALSE
## height FALSE FALSE
## adipos FALSE FALSE
## free FALSE FALSE
## neck FALSE FALSE
## chest FALSE FALSE
## abdom FALSE FALSE
## hip FALSE FALSE
## thigh FALSE FALSE
## knee FALSE FALSE
## ankle FALSE FALSE
## biceps FALSE FALSE
## forearm FALSE FALSE
## wrist FALSE FALSE
## 1 subsets of each size up to 15
## Selection Algorithm: exhaustive
## age weight height adipos free neck chest abdom hip thigh knee
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " "
## 2 ( 1 ) " " "*" " " " " "*" " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " "*" " " " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " "*" " " " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " "*" "*" " " " " "*" " " " " " "
## 6 ( 1 ) " " "*" " " "*" "*" " " " " "*" " " "*" " "
## 7 ( 1 ) " " "*" " " "*" "*" " " "*" "*" " " "*" " "
## 8 ( 1 ) " " "*" " " "*" "*" " " "*" "*" " " "*" " "
## 9 ( 1 ) " " "*" " " "*" "*" " " "*" "*" " " "*" " "
## 10 ( 1 ) " " "*" " " "*" "*" " " "*" "*" " " "*" " "
## 11 ( 1 ) " " "*" "*" "*" "*" " " "*" "*" " " "*" " "
## 12 ( 1 ) " " "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " "*" "*"
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## ankle biceps forearm wrist
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " "*" " "
## 4 ( 1 ) " " " " "*" " "
## 5 ( 1 ) " " " " "*" " "
## 6 ( 1 ) " " " " "*" " "
## 7 ( 1 ) " " " " "*" " "
## 8 ( 1 ) "*" " " "*" " "
## 9 ( 1 ) "*" "*" "*" " "
## 10 ( 1 ) "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*"
## 14 ( 1 ) "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*"
확인해본 결과 앞에서 구한 것(default로 nvmax = 8)과 동일하게 full model에서 가장 높은 두 가지 변수는 weight, free 이었다.
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
reg.summary$rsq
## [1] 0.6621178 0.9580617 0.9608161 0.9621765 0.9649421 0.9667635 0.9682022
## [8] 0.9688390 0.9693710 0.9696405 0.9698293 0.9699951 0.9700291 0.9700390
## [15] 0.9700398
brozek(반응변수)에 대한 예측변수의 설명력 측면에서 R^2 통계량은 변수의 개수가 증가할수록 단방향(monotonically)적으로 증가 패턴을 보였다. 그러나 law of parsimony에 따라 변수가 2개로 증가하였을 때와 모든 변수를 투입하였을때 r^2 통계량 즉, 설명력이 거의 유사하므로, 변수 2개가 가장 최적화된 변수 개수로 판단하였다.
# RSS
par(mfrow=c(2,2))
plot(reg.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
# Adjusted RSq
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
which.max(reg.summary$adjr2)
## [1] 12
points(12,reg.summary$adjr2[12], col="red",cex=2,pch=20)
# Cp
plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type='l')
which.min(reg.summary$cp)
## [1] 10
points(10,reg.summary$cp[10],col="red",cex=2,pch=20)
# BIC
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type='l')
which.min(reg.summary$bic)
## [1] 7
points(7,reg.summary$bic[7],col="red",cex=2,pch=20)
coef(regfit.full,6)
## (Intercept) weight adipos free abdom thigh
## 4.5727334 0.3697963 -0.3880614 -0.5137781 0.1629766 0.1317591
## forearm
## 0.3138644
Adusted R^2은 가장 높은 경우가 변수를 12개일 때이다. Cp, BIC 통계량은 각각 10개, 7개일때가 그 값이 가장 낮았다.
#regfit.full plot
par(mfrow=c(2,2))
plot(regfit.full, scale = "bic", main = "BIC")
plot(regfit.full, scale = "adjr2", main = "Adjusted R^2")
plot(regfit.full, scale = "Cp", main = "Mellow's Cp")
plot(regfit.full, scale = "r2", main = "r2")
네가지 그래프에서 모두 black 검정색으로 칠해진 구간이 가장 많은 변수는 weight와 free 변수였다. 이들의 회귀 계수을 찾아보기로 했다.
coef(regfit.full,2)
## (Intercept) weight free
## 19.6535898 0.4229089 -0.5314993
이를 가지고 반응변수 brozek을 Y, 예측변수 weight를 X1, free를 X2이라고 지정하면
Y = 0.4229089X1 -0.5314993X2 + 19.6535898 + additive random error
적합된(fit) 회귀식이 나타난다.
# manully VS
set.seed(20144607)
train=sample(c(TRUE,FALSE), nrow(fat), replace = TRUE)
test=(!train)
library(leaps)
regfit.best <- regsubsets(brozek ~ ., data = fat[train, ], nvmax = 15)
test.mat <- model.matrix(brozek~., data = fat[test, ])
val.error <- rep(NA, 15)
training set, test set을 직접 만들어줄 때 사용한다.
이제 자동으로 만들어보겠다.
for (j in 1:20) {
set.seed(20144607 * j)
train <- sample(c(TRUE, FALSE), nrow(fat), replace = TRUE)
test <- (!train)
regfit.best <- regsubsets(brozek ~ ., data = fat[train, ], nvmax = 15)
test.mat <- model.matrix(brozek~., data = fat[test, ])
val.error <- rep(NA, 15)
for (i in 1:15) {
coefi <- coef(regfit.best, id = i)
pred <- test.mat[, names(coefi)] %*% coefi
val.error[i] <- mean((fat$brozek[test] - pred) ^ 2)
}
if (j < 2) {
plot(
val.error, main = "Validation set method",type = "l", ylim = c(0, max(val.error) * 1.5),
xlab = "index"
)
} else {
lines(val.error, type = "l", col = j)
}
}
validation approach에서 test MSE는 변수의 갯수가 2일때 가장 낮았고 그 이후에는 불규칙적이었다. 따라서 변수는 2개로 잡는 것이 가장 안정적이고 law of parsimony를 따르면서 test MSE도 최소화 하는 것이다.
predict.regsubsets <- function(object, newdata, id, ...) {
form <- as.formula(object$call [[2]])
mat <- model.matrix(form, newdata)
coefi <- coef(object, id = id)
xvars <- names(coefi)
mat[, xvars] %*% coefi
}
k=nrow(fat)
set.seed(20144607)
cv.errors <- matrix(NA, k, 15, dimnames=list(NULL, paste(1:15)))
library(leaps)
for(j in 1:k){
best.fit <- regsubsets(brozek~. ,data=fat[-j,], nvmax=15)
for(i in 1:15){
pred=predict(best.fit, fat[j,],id=i)
cv.errors[j,i]=mean((fat$brozek[j]- pred)^2)
}
}
mean.cv.errors <- apply(cv.errors, 2, mean)
plot(c(1:15),mean.cv.errors,main = "LOOCV method",type="o",xlab="Index",col="darkcyan")
LOOCV 방식을 사용하니 randomness가 없고 하나의 평균 값만 나타났다. 그래서 MSE가 낮아지고 변수가 2개보다 늘어났을 때 MSE는 크게 차이가 나지 않았다. 따라서, Law of parsimony에 따라 변수를 2개로 잡는 것이 가장 최적화된 변수 개수이다.
# 10-Fold Cross-Validation for linear model
k=10
library(RColorBrewer)
library(leaps)
for (j in 1:k) {
set.seed(20144607 * j)
folds <- sample(1:k, nrow(fat), replace = TRUE, prob = rep(1/k, k))
kfcv.errors <- rep(0, 15)
for (i in 1:15) {
pred <- predict(best.fit, fat[folds == k, ], id = i)
kfcv.errors[i] <- mean((fat$brozek[folds == k] - pred) ^ 2)
}
if (j < 2) {
plot(
kfcv.errors, type = "o", main = "10-Fold Cross-Validation for linear model",ylim = c(0, max(kfcv.errors) * 2),
ylab = "kfcv.error", xlab = "index"
)
} else {
lines(kfcv.errors, type = "o", col = brewer.pal(9, "Set1")[j])
}
}
10-fold CV를 적용했을 때 MSE를 구한 결과 10개의 그래프가 변수가 2개일때 가장 많이 낮아지고 그 이후로는 비슷한 test MSE를 보였다. LOOCV보다 randomness가 생겼지만 그래프끼리 거의 유사한 모양으로 variance가 각 그래프마다 크게 차이가 나지 않았다. 종합했을 때, 변수가 2개일 때 가장 최적화된 model이 될 수 있다고 판단했다.