系統環境:R 3.4.4 (x64) on Microsoft Windows 10 Ver. 1709
本題會使用的 package 有 Lanman / dplyr / magrittr
,先將這些 package include 進來備用
library(Lahman)
library(dplyr)
library(magrittr)
選擇被三振次數(SO),球員的年資(finalGame-debut),出場數(G),安打數(H),打點數(RBI),接殺數(PO),盜壘數(SB)
## 2015 薪水資料
tbl_salary <- tbl_df(Salaries) %>% filter(yearID == 2015) %>% select(playerID,salary) %>%
mutate(salary = salary/10000)
tbl_salary
# A tibble: 817 x 2
playerID salary
<chr> <dbl>
1 ahmedni01 50.8
2 anderch01 51.2
3 chafian01 50.8
4 collmjo01 140.
5 corbipa01 52.4
6 delarru01 51.6
7 delgara01 52.6
8 goldspa01 310.
9 gosewtu01 51.4
10 hellije01 428.
# ... with 807 more rows
## 三振次數(SO),出場數(G),安打數(H),打點數(RBI),盜壘數(SB)
tbl_batting <- tbl_df(Batting) %>% select(playerID,SO,G,H,RBI,SB) %>% group_by(playerID) %>% summarise_all(.funs = funs(sum))
tbl_batting
# A tibble: 18,915 x 6
playerID SO G H RBI SB
<chr> <int> <int> <int> <int> <int>
1 aardsda01 2 331 0 0 0
2 aaronha01 1383 3298 3771 2297 240
3 aaronto01 145 437 216 94 9
4 aasedo01 3 448 0 0 0
5 abadan01 5 15 2 0 0
6 abadfe01 5 315 1 0 0
7 abadijo01 3 12 11 5 1
8 abbated01 NA 855 772 324 142
9 abbeybe01 54 79 38 17 3
10 abbeych01 NA 451 492 280 93
# ... with 18,905 more rows
## 接殺數(PO)
tbl_fielding <- tbl_df(Fielding) %>% select(playerID,PO) %>% group_by(playerID) %>% summarise_all(.funs = funs(sum))
tbl_fielding
# A tibble: 18,714 x 2
playerID PO
<chr> <int>
1 aardsda01 11
2 aaronha01 7436
3 aaronto01 1317
4 aasedo01 67
5 abadan01 37
6 abadfe01 7
7 abadijo01 129
8 abbated01 1872
9 abbeybe01 17
10 abbeych01 917
# ... with 18,704 more rows
## 球員的年資(finalGame-debut)
tbl_master <- tbl_df(Master) %>% select(playerID,debut,finalGame) %>%
mutate(seniority = as.double( (as.Date(finalGame)-as.Date(debut)))/365) %>% select(playerID,seniority)
tbl_master
# A tibble: 19,105 x 2
playerID seniority
<chr> <dbl>
1 aardsda01 11.4
2 aaronha01 22.5
3 aaronto01 9.47
4 aasedo01 13.2
5 abadan01 4.59
6 abadfe01 6.17
7 abadijo01 0.123
8 abbated01 13.0
9 abbeybe01 4.28
10 abbeych01 4.01
# ... with 19,095 more rows
## 全部 join 到 tbl_salary
tbl_salary <- tbl_salary %>% inner_join(tbl_batting, by = "playerID") %>% inner_join(tbl_fielding, by = "playerID") %>% inner_join(tbl_master, by = "playerID")
tbl_salary
# A tibble: 817 x 9
playerID salary SO G H RBI SB PO seniority
<chr> <dbl> <int> <int> <int> <int> <int> <int> <dbl>
1 ahmedni01 50.8 149 249 171 58 9 328 2.07
2 anderch01 51.2 64 81 10 4 0 25 2.39
3 chafian01 50.8 2 101 1 1 0 5 1.89
4 collmjo01 140. 56 203 19 6 0 48 5.46
5 corbipa01 52.4 65 109 30 14 0 18 4.42
6 delarru01 51.6 29 89 10 3 0 50 5.32
7 delgara01 52.6 33 236 14 2 0 26 5.30
8 goldspa01 310. 739 779 844 507 99 6769 5.18
9 gosewtu01 51.4 77 126 78 30 2 825 3.17
10 hellije01 428. 41 176 19 11 0 64 6.16
# ... with 807 more rows
再來做預測
lm_fit <- lm(salary ~ SO+G+H+RBI+SB+PO+seniority, data = tbl_salary)
summary(lm_fit)
Call:
lm(formula = salary ~ SO + G + H + RBI + SB + PO + seniority,
data = tbl_salary)
Residuals:
Min 1Q Median 3Q Max
-1232.97 -188.90 -64.50 72.13 2769.83
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -79.42321 30.61144 -2.595 0.00964 **
SO -0.04146 0.11099 -0.374 0.70885
G -0.70347 0.14942 -4.708 2.94e-06 ***
H -0.04074 0.17138 -0.238 0.81215
RBI 1.51656 0.25889 5.858 6.81e-09 ***
SB 1.26193 0.39719 3.177 0.00154 **
PO 0.00970 0.01063 0.913 0.36161
seniority 86.66915 6.64989 13.033 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 403.9 on 809 degrees of freedom
Multiple R-squared: 0.4667, Adjusted R-squared: 0.462
F-statistic: 101.1 on 7 and 809 DF, p-value: < 2.2e-16
依照以上資料所述,球員的年資(finalGame-debut)對預測薪水回歸函式影響最大,代表著隨著球員年資多 1 年,薪水會多出 86 萬
lm_fit %>% car::vif()
SO G H RBI SB PO seniority
8.734464 24.645852 42.735451 26.946870 2.929629 2.296168 3.078570
實務上可以發現三振次數(SO),出場數(G),安打數(H),打點數(RBI)的 VIF 值偏高,而嘗試只留打點數(RBI)後發現
lm_fit_fix <- lm(salary ~ RBI+SB+PO+seniority, data = tbl_salary)
summary(lm_fit_fix)
Call:
lm(formula = salary ~ RBI + SB + PO + seniority, data = tbl_salary)
Residuals:
Min 1Q Median 3Q Max
-1405.65 -194.80 -69.33 75.18 2828.59
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.101e+02 3.033e+01 -3.628 0.000303 ***
RBI 6.304e-01 9.118e-02 6.914 9.54e-12 ***
SB 2.669e-01 2.903e-01 0.919 0.358190
PO 1.211e-03 1.071e-02 0.113 0.909999
seniority 6.235e+01 4.455e+00 13.997 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 412.5 on 812 degrees of freedom
Multiple R-squared: 0.4416, Adjusted R-squared: 0.4389
F-statistic: 160.6 on 4 and 812 DF, p-value: < 2.2e-16
其實預測結果 \(R^2\) 並沒有下降很多
lm_fit_fix %>% car::vif()
RBI SB PO seniority
3.204785 1.500886 2.234828 1.324624
也確認 VIF 值沒有特別高的部分,接下來針對離群值來分析
influence(lm_fit)$hat %>% plot(type = "h", ylab = "Value of Leverage Statistic")
influence(lm_fit_fix)$hat %>% plot(type = "h", ylab = "Value of Leverage Statistic")
第一張圖是原始的模型,第二張圖只留下打點數(RBI),盜壘數(SB),接殺數(PO)與球員的年資(finalGame-debut),可以看出離群值的結果稍微好一點點