約500の町の心臓病患者の人口割合(heart.disease)(%)と喫煙割合(smoking)(%),自転車通勤割合(biking)(%)について調べたデータ
出典:Scribbr,‘Multiple Linear Regression | A Quick Guide (Examples)’
d <- read.csv('https://stats.dip.jp/01_ds/data/heart.data.csv')[, -1]
library(DT)
datatable(round(d, 1))
# カラーパレット
COL <- c(rgb(255, 0, 0, 255, max = 255), # 赤
rgb( 0, 0, 255, 255, max = 255), # 青
rgb( 0, 155, 0, 255, max = 255)) # 緑
#install.packages('psych')
library(psych)
pairs.panels(d)
\[ \begin{align} Y_i=&\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i, \epsilon_i \sim \mathbf{N}(0, \sigma^2)\\\\ ここで,&Y_i: 目的変数(心臓病罹患率 heart.disease [%])\\ &x_{i1}: 説明変数1(自転車通勤割合 biking [%])\\ &x_{i2}: 説明変数2(喫煙割合 smoking [%])\\ &\beta_0, \beta_1, \beta_2: 偏回帰係数\\ &\epsilon_i:誤差項\\ &\sigma^2:誤差分散 \end{align} \]
set.seed(1) # 乱数シード
# インデックス番号を無作為抽出する(90%を訓練用,残り10%を試験用)。
ii <- sample(1:nrow(d), size = round(nrow(d)*0.9))
d.tr <- d[ ii, ] # 訓練データ
d.te <- d[-ii, ] # 試験データ
# 心臓病患者率を目的変数,自転車通勤率と喫煙率を説明変数にした
# 回帰モデルにフィッティングする。
fit0 <- lm(heart.disease ~ biking + smoking, data = d.tr)
# 試験データを使って予測
pred0 <- predict(fit0, newdata = d.te)
# 精度計算
get.accuracy <- function(yhat, y, digits = 2)
{
d <- data.frame(MBE = mean(yhat - y),
MAE = mean(abs(yhat - y)),
MAPE = mean(abs((yhat - y) / y)) * 100,
RMSE = sqrt(mean((yhat - y)^2)))
return(round(d, digits))
}
(a <- get.accuracy(pred0, d.te$heart.disease))
## MBE MAE MAPE RMSE
## 1 0.07 0.5 8.68 0.62
#summary(fit)
#install.packages('sjPlot')
library(sjPlot)
tab_model(fit0, show.stat = T, show.aic = T)
heart.disease | ||||
---|---|---|---|---|
Predictors | Estimates | CI | Statistic | p |
(Intercept) | 15.05 | 14.88 – 15.22 | 177.24 | <0.001 |
biking | -0.20 | -0.20 – -0.20 | -137.49 | <0.001 |
smoking | 0.18 | 0.17 – 0.18 | 46.42 | <0.001 |
Observations | 448 | |||
R2 / R2 adjusted | 0.979 / 0.979 | |||
AIC | 902.026 |
plot_model(fit0, show.values = T, show.intercept = T, width = 0.1)
回帰モデルはハイパーパラメータが存在しないので,正確な予測精度を推定する だけの利用となる。
オプションnumberはfold数
#install.packages('caret')
library(caret)
fit1 <- train(heart.disease ~ biking + smoking, data = d, method = 'lm',
trControl = trainControl(method = 'LOOCV'))
fit1
## Linear Regression
##
## 498 samples
## 2 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 497, 497, 497, 497, 497, 497, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.6559766 0.9793718 0.5182802
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
fit.cv5 <- train(heart.disease ~ biking + smoking, data = d,
method = 'lm',
trControl = trainControl(method = 'cv', number = 5))
fit.cv5
## Linear Regression
##
## 498 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 398, 398, 399, 398, 399
## Resampling results:
##
## RMSE Rsquared MAE
## 0.6550929 0.9796439 0.5195836
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
fit.cv10 <- train(heart.disease ~ biking + smoking, data = d,
method = 'lm',
trControl = trainControl(method = 'cv', number = 10))
fit.cv10
## Linear Regression
##
## 498 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 450, 448, 449, 448, 449, 447, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.6513574 0.9795974 0.5175201
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
fit.r3cv10 <- train(heart.disease ~ biking + smoking, data = d,
method = 'lm',
trControl = trainControl(method = 'repeatedcv',
number = 10, repeats = 3))
fit.r3cv10
## Linear Regression
##
## 498 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 446, 449, 448, 450, 446, 448, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.6524242 0.9800117 0.518022
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
縮小回帰(Shrinkage Method)のハイパーパラメータ推定を交差検証法を用いて 行う。
縮小回帰の目的関数では回帰係数の大きさが罰則項として存在する。lambda(\(\lambda\))は正則化係数で回帰係数の大きさに対する罰則(penalty)を調整する係数である。 \[ \underset{\mathbf{\beta}\in \mathbb{R}^{p+1}}{\min} \Bigg[\frac{1}{2N}\sum_{i=1}^N(y_i-\mathbf{X}\mathbf{\beta})^2+\lambda P_\alpha(\mathbf{\beta})\Bigg] \] ここで, \[ P_\alpha (\mathbf{\beta}) = (1-\alpha)\frac{1}{2}||\mathbf{\beta}||_{L_2}^2 + \alpha ||\mathbf{\beta}||_{L_1} \] は,回帰係数への罰則(penalty)項である。オプションalpha(\(\alpha\))は,0のとき,Ridge回帰,1のときLasso回帰,0超1未満のときElastic Netとなる。
myGrid <- expand.grid(alpha = seq(0, 1, 0.1),
lambda = seq(0.0, 1.0, length = 20))
fit.en <- train(heart.disease ~ biking + smoking, data = d.tr,
method = 'glmnet',
tuneGrid = myGrid,
trControl = trainControl(method = 'repeatedcv',
number = 5, repeats = 10,
allowParallel = T))
# 最適パラメータ
fit.en$bestTune
## alpha lambda
## 161 0.8 0
plot(fit.en)
pred.en <- predict(fit.en, newdata = d.te)
(a.en <- get.accuracy(pred.en, d.te$heart.disease))
## MBE MAE MAPE RMSE
## 1 0.07 0.5 9.12 0.62
library(glmnet)
head(d.tr)
## biking smoking heart.disease
## 324 34.53005 16.781663 10.729181
## 167 44.69822 10.002589 7.449413
## 129 17.64814 27.192564 15.138968
## 418 62.02105 23.883050 6.959502
## 471 60.58736 8.173587 4.938059
## 299 51.69270 19.199617 6.952055
x <- as.matrix(d.tr[, -3])
y <- as.matrix(d.tr[, 3])
fit <- cv.glmnet(x, y, alpha = 1, nfolds = 5) # alpha = 1 (Lasso)
fit
##
## Call: cv.glmnet(x = x, y = y, nfolds = 5, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.02122 58 0.4371 0.02625 2
## 1se 0.11324 40 0.4633 0.02378 2
plot(fit)
pred <- predict(fit, newx = as.matrix(d.te[, -3]))
# glmnetパッケージによる予測精度
(a <- get.accuracy(pred, d.te$heart.disease))
## MBE MAE MAPE RMSE
## 1 0.06 0.54 11.33 0.67
# cf. caratパッケージによる予測精度
a.en
## MBE MAE MAPE RMSE
## 1 0.07 0.5 9.12 0.62
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error
d = pd.read_csv('https://stats.dip.jp/01_ds/data/heart.data.csv')
d.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 498 entries, 0 to 497
## Data columns (total 4 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Unnamed: 0 498 non-null int64
## 1 biking 498 non-null float64
## 2 smoking 498 non-null float64
## 3 heart.disease 498 non-null float64
## dtypes: float64(3), int64(1)
## memory usage: 15.7 KB
x = d[['biking', 'smoking']]
y = d['heart.disease']
x_tr, x_te, y_tr, y_te = train_test_split(x, y, test_size = 0.1, random_state = 0)
scaler = StandardScaler()
model = LassoCV(alphas = 10 ** np.arange(-6, 0, 0.1), cv = 5)
scaler.fit(x_tr)
StandardScaler()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
StandardScaler()
model.fit(scaler.transform(x_tr), y_tr)
LassoCV(alphas=array([1.00000000e-06, 1.25892541e-06, 1.58489319e-06, 1.99526231e-06, 2.51188643e-06, 3.16227766e-06, 3.98107171e-06, 5.01187234e-06, 6.30957344e-06, 7.94328235e-06, 1.00000000e-05, 1.25892541e-05, 1.58489319e-05, 1.99526231e-05, 2.51188643e-05, 3.16227766e-05, 3.98107171e-05, 5.01187234e-05, 6.30957344e-05, 7.94328235e-05, 1.00000000e-04, 1.25892541e-0... 3.98107171e-03, 5.01187234e-03, 6.30957344e-03, 7.94328235e-03, 1.00000000e-02, 1.25892541e-02, 1.58489319e-02, 1.99526231e-02, 2.51188643e-02, 3.16227766e-02, 3.98107171e-02, 5.01187234e-02, 6.30957344e-02, 7.94328235e-02, 1.00000000e-01, 1.25892541e-01, 1.58489319e-01, 1.99526231e-01, 2.51188643e-01, 3.16227766e-01, 3.98107171e-01, 5.01187234e-01, 6.30957344e-01, 7.94328235e-01]), cv=5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LassoCV(alphas=array([1.00000000e-06, 1.25892541e-06, 1.58489319e-06, 1.99526231e-06, 2.51188643e-06, 3.16227766e-06, 3.98107171e-06, 5.01187234e-06, 6.30957344e-06, 7.94328235e-06, 1.00000000e-05, 1.25892541e-05, 1.58489319e-05, 1.99526231e-05, 2.51188643e-05, 3.16227766e-05, 3.98107171e-05, 5.01187234e-05, 6.30957344e-05, 7.94328235e-05, 1.00000000e-04, 1.25892541e-0... 3.98107171e-03, 5.01187234e-03, 6.30957344e-03, 7.94328235e-03, 1.00000000e-02, 1.25892541e-02, 1.58489319e-02, 1.99526231e-02, 2.51188643e-02, 3.16227766e-02, 3.98107171e-02, 5.01187234e-02, 6.30957344e-02, 7.94328235e-02, 1.00000000e-01, 1.25892541e-01, 1.58489319e-01, 1.99526231e-01, 2.51188643e-01, 3.16227766e-01, 3.98107171e-01, 5.01187234e-01, 6.30957344e-01, 7.94328235e-01]), cv=5)
yhat = model.predict(scaler.transform(x_te))
mse = mean_squared_error(y_te, yhat)
print(mse)
## 0.4081642127844029
次のデータを使用して, 縮小回帰(Shrinkage Method)のハイパーパラメータ(\(\alpha\),\(\lambda\)) 推定を交差検証法を用いて行え。 なお,目的変数は「燃費」とする。
d <- read.csv('https://stats.dip.jp/01_ds/data/car_mileage.csv')
library(DT)
datatable(d)