パッケージにあらかじめ組み込まれているcarsというデータを使う。

data(cars) #データを読み込む
head(cars) #データの内容確認

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

cars$speed<-cars$speed*1.6 #マイルをkmに直す。
cars$dist<-cars$dist*0.3 #フィートをmに直す。
head(cars) #データの内容確認

##   speed dist
## 1   6.4  0.6
## 2   6.4  3.0
## 3  11.2  1.2
## 4  11.2  6.6
## 5  12.8  4.8
## 6  14.4  3.0

plot(cars) #散布図を描く

cor(cars) #cor()で相関係数を出す。

##           speed      dist
## speed 1.0000000 0.8068949
## dist  0.8068949 1.0000000

speedとdistの間の相関係数は約0.8069。

回帰分析を行う。関数はlm。

目的変数~説明変数　という書き方をする。

dist~speed　でdist=a*speed+bという回帰式を示す。

cars.lm<-lm(dist~speed, data=cars) #回帰分析
summary(cars.lm) #結果の表示

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7207 -2.8576 -0.6816  2.7644 12.9604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.27373    2.02753  -2.601   0.0123 *  
## speed        0.73733    0.07791   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.614 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Residuals:

残差。データから求められる実際の値と理論式から求められる値との差。

Coefficients:

回帰式の係数 Estimateの欄が推定値。 dist=aspeed+bにおいて、 dist=0.73733speed-5.27373となることを示している。

Std. Errorは係数の誤差。 t valueは推定値の検定統計量。

Signif. codes:

Pr(>|t|)の右隣に示されている*の数がどういう意味かを示している。

0.0123 となっているので、(Intercept)の信頼度は、有意水準で ’’ 0.05 ‘ 要するに有意水準5%で帰無仮説「係数は0である」を棄却する。係数が0となるということは回帰式の切片が0であることを示すが、それを棄却するので、切片Interceptの値には意味があるということ。

1.49e-12 ***となっているので、speedの信頼度は、有意水準で ‘0 '***' 要するに有意水準がほぼ0%に近い値においても、帰無仮説「係数は0である」を棄却する。係数が0となるということは回帰式の回帰係数が0であることを示すが、それを棄却するので、回帰係数speedの値には意味があるということ。

Residual standard error:

残差の標準誤差。残差の二乗和を残差の自由度で割った平方根。

Multiple R-squared:

決定係数(coefficient of determination)、寄与率(contribution ration)

Adjusted R-squared:

調整済み決定係数。標本数と説明変数の数を考慮した係数。

ともに1に近いほど、データがモデルに当てはまっている。

F-statistic:

F統計量。帰無仮説「決定係数が0である」を分散分析の手法で検定している。p-value:1.49e-12なので、帰無仮説を5%有意水準においても、それ以下の水準においても棄却できる。

plot(cars)
abline(cars.lm) #散布図に回帰直線を書き加える。

confidence_intervals <- confint(cars.lm)
confidence_intervals

##                  2.5 %     97.5 %
## (Intercept) -9.3503549 -1.1971021
## speed        0.5806808  0.8939725

confin()で回帰係数の信頼区間が出力される。 dist=aspeed+bとして、予測された回帰式は dist=0.73733speed-5.27373 だったが、そのaとbの信頼区間は0.5806808<a<0.8939725 -9.3503549<b<-1.1971021 ということ。

#全ての値の予測値と信頼区間
predict<-predict(cars.lm, interval="confidence")
predict

##          fit       lwr       upr
## 1  -0.554838 -3.698863  2.589187
## 2  -0.554838 -3.698863  2.589187
## 3   2.984330  0.503693  5.464967
## 4   2.984330  0.503693  5.464967
## 5   4.164053  1.892258  6.435847
## 6   5.343775  3.271536  7.416014
## 7   6.523498  4.638575  8.408420
## 8   6.523498  4.638575  8.408420
## 9   6.523498  4.638575  8.408420
## 10  7.703220  5.989358  9.417083
## 11  7.703220  5.989358  9.417083
## 12  8.882943  7.318541 10.447345
## 13  8.882943  7.318541 10.447345
## 14  8.882943  7.318541 10.447345
## 15  8.882943  7.318541 10.447345
## 16 10.062666  8.619401 11.505931
## 17 10.062666  8.619401 11.505931
## 18 10.062666  8.619401 11.505931
## 19 10.062666  8.619401 11.505931
## 20 11.242388  9.884335 12.600442
## 21 11.242388  9.884335 12.600442
## 22 11.242388  9.884335 12.600442
## 23 11.242388  9.884335 12.600442
## 24 12.422111 11.106346 13.737876
## 25 12.422111 11.106346 13.737876
## 26 12.422111 11.106346 13.737876
## 27 13.601834 12.281303 14.922364
## 28 13.601834 12.281303 14.922364
## 29 14.781556 13.409696 16.153416
## 30 14.781556 13.409696 16.153416
## 31 14.781556 13.409696 16.153416
## 32 15.961279 14.496413 17.426144
## 33 15.961279 14.496413 17.426144
## 34 15.961279 14.496413 17.426144
## 35 15.961279 14.496413 17.426144
## 36 17.141001 15.548740 18.733263
## 37 17.141001 15.548740 18.733263
## 38 17.141001 15.548740 18.733263
## 39 18.320724 16.574186 20.067263
## 40 18.320724 16.574186 20.067263
## 41 18.320724 16.574186 20.067263
## 42 18.320724 16.574186 20.067263
## 43 18.320724 16.574186 20.067263
## 44 20.680169 18.568890 22.791448
## 45 21.859892 19.546993 24.172791
## 46 23.039615 20.516296 25.562933
## 47 23.039615 20.516296 25.562933
## 48 23.039615 20.516296 25.562933
## 49 23.039615 20.516296 25.562933
## 50 24.219337 21.478825 26.959850

#残差
residual<-residuals(cars.lm)
residual

##          1          2          3          4          5          6          7 
##  1.1548380  3.5548380 -1.7843299  3.6156701  0.6359474 -2.3437752 -1.1234978 
##          8          9         10         11         12         13         14 
##  1.2765022  3.6765022 -2.6032204  0.6967796 -4.6829431 -2.8829431 -1.6829431 
##         15         16         17         18         19         20         21 
## -0.4829431 -2.2626657  0.1373343  0.1373343  3.7373343 -3.4423883 -0.4423883 
##         22         23         24         25         26         27         28 
##  6.7576117 12.7576117 -6.4221109 -4.6221109  3.7778891 -4.0018336 -1.6018336 
##         29         30         31         32         33         34         35 
## -5.1815562 -2.7815562  0.2184438 -3.3612788  0.8387212  6.8387212  9.2387212 
##         36         37         38         39         40         41         42 
## -6.3410015 -3.3410015  3.2589985 -8.7207241 -3.9207241 -2.7207241 -1.5207241 
##         43         44         45         46         47         48         49 
##  0.8792759 -0.8801693 -5.6598920 -2.0396146  4.5603854  4.8603854 12.9603854 
##         50 
##  1.2806628

#
par(mfrow=c(2,2))
plot(cars.lm)

Residuals vs Fitted

予測値(Fitted values)と残差の散布図。

Nomal QQ

データが正規分布に従うと点が直線上に並ぶ。

Scale-Location

予測値と残差の平方根

Residuals vs Leverage

梃子値(Leverage)と標準化した残差の散布図。点線がクック距離というもの。梃子値もクック距離もモデルの当てはまりの良さを示す。

決定係数の計算

# 元のデータから計算
sr<-sum((cars.lm$fitted-mean(cars.lm$fitted))^2)
se<-sum((cars$dist-cars.lm$fitted)^2)
st<-sum((cars$dist-mean(cars$dist))^2)
sr/st #決定係数

## [1] 0.6510794

#先ほどのsummary(cars.lm)の値から取り出した場合
summary(cars.lm)$r.squared

## [1] 0.6510794

#元のデータから計算と #先ほどのsummary(cars.lm)の値から取り出した場合が一致する。

F値

# 元の値からF値を求める
ve<-se/48
sr/ve

## [1] 89.56711

# 先ほどのsummary(cars.lm)の値から取り出した場合
summary(cars.lm)$fstatistic

##    value    numdf    dendf 
## 89.56711  1.00000 48.00000