系統環境:R 3.4.4 (x64) on Microsoft Windows 10 Ver. 1709

前置作業

本題會使用的 package 有 Lanman / dplyr / magrittr,先將這些 package include 進來備用

library(Lahman)
library(dplyr)
library(magrittr)

7-1 Please predict both salary and HR by using H and then calculate the corresponding residuals e_salary and e_HR.

投影片中 tbl_baseball 為 2015 年薪水與 2014 年打擊資料 join 而成,為了更符合題目要求,欄位只選擇 salary, HR 及 H

## 取得 2015 年薪水資料
tbl_s <- tbl_df(Salaries) %>% filter(yearID == 2015) %>%
select(-yearID) %>% mutate(salary = salary / 10000)

## 取得 2014 年打擊資料
tbl_b <- tbl_df(Batting) %>% filter(yearID == 2014) %>%
select(-yearID, -stint, -teamID, -lgID) %>% group_by(playerID) %>%
summarise_all(.funs = funs(sum))

## join
tbl_baseball <- tbl_s %>% inner_join(tbl_b, by = "playerID") %>% select(playerID,salary,H,HR)
tbl_baseball
  # A tibble: 792 x 4
     playerID  salary     H    HR
     <chr>      <dbl> <int> <int>
   1 ahmedni01   50.8    14     1
   2 anderch01   51.2     1     0
   3 chafian01   50.8     1     0
   4 collmjo01  140.      6     0
   5 delarru01   51.6     0     0
   6 delgara01   52.6     1     0
   7 goldspa01  310.    122    19
   8 gosewtu01   51.4    29     1
   9 hellije01  428.      0     0
  10 hillaa01  1200.    122    10
  # ... with 782 more rows

使用 H 來預測 salary

lm(salary ~ H,data = tbl_baseball)
  
  Call:
  lm(formula = salary ~ H, data = tbl_baseball)
  
  Coefficients:
  (Intercept)            H  
      295.003        3.051

由上資料可得知 Coefficients 為 3.051,因此我們可以算出每一球員的薪水估計 residual,透過原 salary 的值扣掉 H*3.051 的值,並加入 tbl_baseball

tbl_baseball <- tbl_baseball %>% mutate(e_salary = salary - H*3.051)
tbl_baseball
  # A tibble: 792 x 5
     playerID  salary     H    HR e_salary
     <chr>      <dbl> <int> <int>    <dbl>
   1 ahmedni01   50.8    14     1     8.14
   2 anderch01   51.2     1     0    48.2 
   3 chafian01   50.8     1     0    47.7 
   4 collmjo01  140.      6     0   122.  
   5 delarru01   51.6     0     0    51.6 
   6 delgara01   52.6     1     0    49.5 
   7 goldspa01  310.    122    19   -62.2 
   8 gosewtu01   51.4    29     1   -37.0 
   9 hellije01  428.      0     0   428.  
  10 hillaa01  1200.    122    10   828.  
  # ... with 782 more rows

使用 H 來預測 HR

lm(HR ~ H,data = tbl_baseball)
  
  Call:
  lm(formula = HR ~ H, data = tbl_baseball)
  
  Coefficients:
  (Intercept)            H  
      -0.1011       0.1042

由上資料可得知 Coefficients 為 0.1042,因此我們可以算出每一球員的 HR 估計 residual,透過原 HR 的值扣掉 H*0.1042 的值,並加入 tbl_baseball

tbl_baseball <- tbl_baseball %>% mutate(e_hr = HR - H*0.1042)
tbl_baseball
  # A tibble: 792 x 6
     playerID  salary     H    HR e_salary   e_hr
     <chr>      <dbl> <int> <int>    <dbl>  <dbl>
   1 ahmedni01   50.8    14     1     8.14 -0.459
   2 anderch01   51.2     1     0    48.2  -0.104
   3 chafian01   50.8     1     0    47.7  -0.104
   4 collmjo01  140.      6     0   122.   -0.625
   5 delarru01   51.6     0     0    51.6   0.   
   6 delgara01   52.6     1     0    49.5  -0.104
   7 goldspa01  310.    122    19   -62.2   6.29 
   8 gosewtu01   51.4    29     1   -37.0  -2.02 
   9 hellije01  428.      0     0   428.    0.   
  10 hillaa01  1200.    122    10   828.   -2.71 
  # ... with 782 more rows

7-2 Use e_HR to predict e_salary and report the slope estimate.

這裡想要用 e_HR 來預測 e_salary

lm(e_salary ~ e_hr,data = tbl_baseball)
  
  Call:
  lm(formula = e_salary ~ e_hr, data = tbl_baseball)
  
  Coefficients:
  (Intercept)         e_hr  
       296.39        13.89

由以上結果得知斜率為 13.89


7-3 Now, predict salary by using both H and HR. Report the slope of HR and compare it with the slope obtained in 2.

同時使用 H 與 HR 預測 salary

lm(salary ~ H+HR,data = tbl_baseball)
  
  Call:
  lm(formula = salary ~ H + HR, data = tbl_baseball)
  
  Coefficients:
  (Intercept)            H           HR  
      296.407        1.603       13.891

由以上資料得知 HR 的斜率為 13.891,與上題同


7-4 Could you give an implication about what you observe in 1-3?

由 7-2 7-3的結果可以得知算出來的斜率是相同的,可以推得回歸式存在 \(x_1\), \(x_2\) 兩變量時,對於第二變量 \(x_2\) 預測結果等同於單獨使用第一變量 \(x_1\) 預測的 residual 值與單獨使用第二變量 \(x_2\) 預測的 residual 值做回歸得到的結果