回歸分析

在此,我們使用“diamonds.csv”來做範例的檔案。
##簡單線性回歸

在“diamonds.csv”這份資料中,包含“carat”,“depth”,“price”等資訊(欄),在做回歸分析以前, 先看看這份資料大致的概況,以及這些不同變數之間的散佈圖:

##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 


再來看看不同變數之間的散佈圖:


再來利用Rcommander做線性回歸:

RegModel.7 <- lm(carat~price, data=diamonds)
summary(RegModel.7)
## 
## Call:
## lm(formula = carat ~ price, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35765 -0.11329 -0.02442  0.10344  2.66973 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.673e-01  1.112e-03   330.2   <2e-16 ***
## price       1.095e-04  1.986e-07   551.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.184 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
RegModel.8 <- lm(carat~depth, data=diamonds)
summary(RegModel.8)
## 
## Call:
## lm(formula = carat ~ depth, data = diamonds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6134 -0.4028 -0.0937  0.2488  4.1770 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.221288   0.087960   2.516   0.0119 *  
## depth       0.009339   0.001424   6.558 5.52e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4738 on 53938 degrees of freedom
## Multiple R-squared:  0.0007966,  Adjusted R-squared:  0.0007781 
## F-statistic:    43 on 1 and 53938 DF,  p-value: 5.518e-11
RegModel.9 <- lm(depth~price, data=diamonds)
summary(RegModel.9)
## 
## Call:
## lm(formula = depth ~ price, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.7505  -0.7217   0.0784   0.7569  17.2454 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  6.176e+01  8.661e-03 7130.977   <2e-16 ***
## price       -3.824e-06  1.546e-06   -2.473   0.0134 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.433 on 53938 degrees of freedom
## Multiple R-squared:  0.0001134,  Adjusted R-squared:  9.483e-05 
## F-statistic: 6.115 on 1 and 53938 DF,  p-value: 0.0134


我們知道R2應介於0與1之間,R2的大小表示兩變數的相關程度,當R2的值越接近1,就表示兩者的相關程度越高。
由以上三個數據的分析結果,我們可以看出應該由carat與price(圖一)的相關程度應為最高。另外兩者的R2值都明顯小於前者,可說是幾乎無相關。