在此,我們使用“diamonds.csv”來做範例的檔案。
##簡單線性回歸
在“diamonds.csv”這份資料中,包含“carat”,“depth”,“price”等資訊(欄),在做回歸分析以前, 先看看這份資料大致的概況,以及這些不同變數之間的散佈圖:
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
再來看看不同變數之間的散佈圖:
再來利用Rcommander做線性回歸:
RegModel.7 <- lm(carat~price, data=diamonds)
summary(RegModel.7)
##
## Call:
## lm(formula = carat ~ price, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35765 -0.11329 -0.02442 0.10344 2.66973
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.673e-01 1.112e-03 330.2 <2e-16 ***
## price 1.095e-04 1.986e-07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.184 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
RegModel.8 <- lm(carat~depth, data=diamonds)
summary(RegModel.8)
##
## Call:
## lm(formula = carat ~ depth, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6134 -0.4028 -0.0937 0.2488 4.1770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.221288 0.087960 2.516 0.0119 *
## depth 0.009339 0.001424 6.558 5.52e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4738 on 53938 degrees of freedom
## Multiple R-squared: 0.0007966, Adjusted R-squared: 0.0007781
## F-statistic: 43 on 1 and 53938 DF, p-value: 5.518e-11
RegModel.9 <- lm(depth~price, data=diamonds)
summary(RegModel.9)
##
## Call:
## lm(formula = depth ~ price, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.7505 -0.7217 0.0784 0.7569 17.2454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.176e+01 8.661e-03 7130.977 <2e-16 ***
## price -3.824e-06 1.546e-06 -2.473 0.0134 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.433 on 53938 degrees of freedom
## Multiple R-squared: 0.0001134, Adjusted R-squared: 9.483e-05
## F-statistic: 6.115 on 1 and 53938 DF, p-value: 0.0134
我們知道R2應介於0與1之間,R2的大小表示兩變數的相關程度,當R2的值越接近1,就表示兩者的相關程度越高。
由以上三個數據的分析結果,我們可以看出應該由carat與price(圖一)的相關程度應為最高。另外兩者的R2值都明顯小於前者,可說是幾乎無相關。