系統環境:R 3.4.4 (x64) on Microsoft Windows 10 Ver. 1709

前置作業

本題會使用的 package 有 Lanman / dplyr / magrittr,先將這些 package include 進來備用

library(Lahman)
library(dplyr)
library(magrittr)

8-1 Please find a regression model with at most 10 covariates that well predicts salary of 2015.

選擇被三振次數(SO),球員的年資(finalGame-debut),出場數(G),安打數(H),打點數(RBI),接殺數(PO),盜壘數(SB)

## 2015 薪水資料
tbl_salary <- tbl_df(Salaries) %>% filter(yearID == 2015)  %>% select(playerID,salary) %>% 
  mutate(salary = salary/10000)
tbl_salary
  # A tibble: 817 x 2
     playerID  salary
     <chr>      <dbl>
   1 ahmedni01   50.8
   2 anderch01   51.2
   3 chafian01   50.8
   4 collmjo01  140. 
   5 corbipa01   52.4
   6 delarru01   51.6
   7 delgara01   52.6
   8 goldspa01  310. 
   9 gosewtu01   51.4
  10 hellije01  428. 
  # ... with 807 more rows
## 三振次數(SO),出場數(G),安打數(H),打點數(RBI),盜壘數(SB)
tbl_batting <- tbl_df(Batting) %>% select(playerID,SO,G,H,RBI,SB) %>% group_by(playerID) %>% summarise_all(.funs = funs(sum))
tbl_batting
  # A tibble: 18,915 x 6
     playerID     SO     G     H   RBI    SB
     <chr>     <int> <int> <int> <int> <int>
   1 aardsda01     2   331     0     0     0
   2 aaronha01  1383  3298  3771  2297   240
   3 aaronto01   145   437   216    94     9
   4 aasedo01      3   448     0     0     0
   5 abadan01      5    15     2     0     0
   6 abadfe01      5   315     1     0     0
   7 abadijo01     3    12    11     5     1
   8 abbated01    NA   855   772   324   142
   9 abbeybe01    54    79    38    17     3
  10 abbeych01    NA   451   492   280    93
  # ... with 18,905 more rows
## 接殺數(PO)
tbl_fielding <- tbl_df(Fielding) %>% select(playerID,PO) %>% group_by(playerID) %>% summarise_all(.funs = funs(sum))
tbl_fielding
  # A tibble: 18,714 x 2
     playerID     PO
     <chr>     <int>
   1 aardsda01    11
   2 aaronha01  7436
   3 aaronto01  1317
   4 aasedo01     67
   5 abadan01     37
   6 abadfe01      7
   7 abadijo01   129
   8 abbated01  1872
   9 abbeybe01    17
  10 abbeych01   917
  # ... with 18,704 more rows
## 球員的年資(finalGame-debut)
tbl_master <- tbl_df(Master) %>% select(playerID,debut,finalGame) %>% 
  mutate(seniority = as.double( (as.Date(finalGame)-as.Date(debut)))/365) %>% select(playerID,seniority)
tbl_master
  # A tibble: 19,105 x 2
     playerID  seniority
     <chr>         <dbl>
   1 aardsda01    11.4  
   2 aaronha01    22.5  
   3 aaronto01     9.47 
   4 aasedo01     13.2  
   5 abadan01      4.59 
   6 abadfe01      6.17 
   7 abadijo01     0.123
   8 abbated01    13.0  
   9 abbeybe01     4.28 
  10 abbeych01     4.01 
  # ... with 19,095 more rows
## 全部 join 到 tbl_salary
tbl_salary <- tbl_salary %>% inner_join(tbl_batting, by = "playerID") %>% inner_join(tbl_fielding, by = "playerID") %>% inner_join(tbl_master, by = "playerID")
tbl_salary
  # A tibble: 817 x 9
     playerID  salary    SO     G     H   RBI    SB    PO seniority
     <chr>      <dbl> <int> <int> <int> <int> <int> <int>     <dbl>
   1 ahmedni01   50.8   149   249   171    58     9   328      2.07
   2 anderch01   51.2    64    81    10     4     0    25      2.39
   3 chafian01   50.8     2   101     1     1     0     5      1.89
   4 collmjo01  140.     56   203    19     6     0    48      5.46
   5 corbipa01   52.4    65   109    30    14     0    18      4.42
   6 delarru01   51.6    29    89    10     3     0    50      5.32
   7 delgara01   52.6    33   236    14     2     0    26      5.30
   8 goldspa01  310.    739   779   844   507    99  6769      5.18
   9 gosewtu01   51.4    77   126    78    30     2   825      3.17
  10 hellije01  428.     41   176    19    11     0    64      6.16
  # ... with 807 more rows

再來做預測

lm_fit <- lm(salary ~ SO+G+H+RBI+SB+PO+seniority, data = tbl_salary) 
summary(lm_fit)
  
  Call:
  lm(formula = salary ~ SO + G + H + RBI + SB + PO + seniority, 
      data = tbl_salary)
  
  Residuals:
       Min       1Q   Median       3Q      Max 
  -1232.97  -188.90   -64.50    72.13  2769.83 
  
  Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
  (Intercept) -79.42321   30.61144  -2.595  0.00964 ** 
  SO           -0.04146    0.11099  -0.374  0.70885    
  G            -0.70347    0.14942  -4.708 2.94e-06 ***
  H            -0.04074    0.17138  -0.238  0.81215    
  RBI           1.51656    0.25889   5.858 6.81e-09 ***
  SB            1.26193    0.39719   3.177  0.00154 ** 
  PO            0.00970    0.01063   0.913  0.36161    
  seniority    86.66915    6.64989  13.033  < 2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 403.9 on 809 degrees of freedom
  Multiple R-squared:  0.4667,  Adjusted R-squared:  0.462 
  F-statistic: 101.1 on 7 and 809 DF,  p-value: < 2.2e-16

8-2 Which varialbe is the most significant one in your model? Why? How to interpretate the value of its slope?

依照以上資料所述,球員的年資(finalGame-debut)對預測薪水回歸函式影響最大,代表著隨著球員年資多 1 年,薪水會多出 86 萬


8-3 Please use the variance influence factor and the leverage statistic to assess potential problems in your model. Any suggestion for the identified problem?

lm_fit %>% car::vif()
         SO         G         H       RBI        SB        PO seniority 
   8.734464 24.645852 42.735451 26.946870  2.929629  2.296168  3.078570

實務上可以發現三振次數(SO),出場數(G),安打數(H),打點數(RBI)的 VIF 值偏高,而嘗試只留打點數(RBI)後發現

lm_fit_fix <- lm(salary ~ RBI+SB+PO+seniority, data = tbl_salary)
summary(lm_fit_fix)
  
  Call:
  lm(formula = salary ~ RBI + SB + PO + seniority, data = tbl_salary)
  
  Residuals:
       Min       1Q   Median       3Q      Max 
  -1405.65  -194.80   -69.33    75.18  2828.59 
  
  Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
  (Intercept) -1.101e+02  3.033e+01  -3.628 0.000303 ***
  RBI          6.304e-01  9.118e-02   6.914 9.54e-12 ***
  SB           2.669e-01  2.903e-01   0.919 0.358190    
  PO           1.211e-03  1.071e-02   0.113 0.909999    
  seniority    6.235e+01  4.455e+00  13.997  < 2e-16 ***
  ---
  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  
  Residual standard error: 412.5 on 812 degrees of freedom
  Multiple R-squared:  0.4416,  Adjusted R-squared:  0.4389 
  F-statistic: 160.6 on 4 and 812 DF,  p-value: < 2.2e-16

其實預測結果 \(R^2\) 並沒有下降很多

lm_fit_fix %>% car::vif()
        RBI        SB        PO seniority 
   3.204785  1.500886  2.234828  1.324624

也確認 VIF 值沒有特別高的部分,接下來針對離群值來分析

influence(lm_fit)$hat %>% plot(type = "h", ylab = "Value of Leverage Statistic")

influence(lm_fit_fix)$hat %>% plot(type = "h", ylab = "Value of Leverage Statistic")

第一張圖是原始的模型,第二張圖只留下打點數(RBI),盜壘數(SB),接殺數(PO)與球員的年資(finalGame-debut),可以看出離群值的結果稍微好一點點