This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
The initial regression model is given by:
\[ \widehat{\log(salary)} = 11.19 + 0.0689 \cdot years + 0.0126 \cdot gamesyr + 0.00098 \cdot bavg + 0.0144 \cdot hrunsyr + 0.018 \cdot rbisyr \]
Where: - years
: Number of years the player has been in
the league - gamesyr
: Games per year - bavg
:
Batting average - hrunsyr
: Home runs per year -
rbisyr
: Runs batted in per year
Given: - \(n = 353\) - \(SSR = 183.186\) - \(R^2 = 0.6278\)
mlb1 <- read_excel("mlb1.xlsx")
mlb1 <- na.omit(mlb1)
head(mlb1)
## # A tibble: 6 × 47
## allstar atbats atbatsyr bavg bb black blackpop blckpb blckph catcher
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 75 6705 559. 289 619 0 1547725 0 0 0
## 2 25 3333 417. 259 137 1 1547725 18.8 10.9 0
## 3 0 2807 561. 299 341 0 1547725 0 0 0
## 4 0 3337 417. 245 306 0 1547725 0 0 0
## 5 0 3603 300. 258 316 1 1547725 18.8 10.9 0
## 6 11.8 7489 441. 286 416 1 1547725 18.8 10.9 0
## # ℹ 37 more variables: doubles <dbl>, fldperc <dbl>, frstbase <dbl>,
## # games <dbl>, gamesyr <dbl>, hispan <dbl>, hisppb <dbl>, hispph <dbl>,
## # hisppop <dbl>, hits <dbl>, hruns <dbl>, hrunsyr <dbl>, lsalary <dbl>,
## # nl <dbl>, outfield <dbl>, pcinc <dbl>, percblck <dbl>, perchisp <dbl>,
## # percwhte <dbl>, rbis <dbl>, rbisyr <dbl>, runs <dbl>, runsyr <dbl>,
## # salary <dbl>, sbases <dbl>, sbasesyr <dbl>, scndbase <dbl>, shrtstop <dbl>,
## # slugavg <dbl>, so <dbl>, teamsal <dbl>, thrdbase <dbl>, triples <dbl>, …
rbisyr
)full_model <- lm(lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr, data = mlb1)
summary(full_model)
##
## Call:
## lm(formula = lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr,
## data = mlb1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0117 -0.4642 -0.0524 0.4629 2.6709
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.164772 0.352595 31.665 < 2e-16 ***
## years 0.073822 0.012751 5.789 1.67e-08 ***
## gamesyr 0.011410 0.002773 4.115 4.92e-05 ***
## bavg 0.001317 0.001424 0.925 0.356
## hrunsyr 0.010137 0.016643 0.609 0.543
## rbisyr 0.011915 0.007453 1.599 0.111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7319 on 324 degrees of freedom
## Multiple R-squared: 0.6101, Adjusted R-squared: 0.6041
## F-statistic: 101.4 on 5 and 324 DF, p-value: < 2.2e-16
tidy(full_model)
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 11.2 0.353 31.7 3.38e-101
## 2 years 0.0738 0.0128 5.79 1.67e- 8
## 3 gamesyr 0.0114 0.00277 4.11 4.92e- 5
## 4 bavg 0.00132 0.00142 0.925 3.56e- 1
## 5 hrunsyr 0.0101 0.0166 0.609 5.43e- 1
## 6 rbisyr 0.0119 0.00745 1.60 1.11e- 1
rbisyr
)restricted_model <- lm(lsalary ~ years + gamesyr + bavg + hrunsyr, data = mlb1)
summary(restricted_model)
##
## Call:
## lm(formula = lsalary ~ years + gamesyr + bavg + hrunsyr, data = mlb1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.05255 -0.46456 -0.02836 0.47082 2.71721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.929535 0.321197 34.028 < 2e-16 ***
## years 0.072366 0.012749 5.676 3.05e-08 ***
## gamesyr 0.014909 0.001707 8.732 < 2e-16 ***
## bavg 0.002004 0.001361 1.472 0.142
## hrunsyr 0.033936 0.007463 4.547 7.69e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7336 on 325 degrees of freedom
## Multiple R-squared: 0.607, Adjusted R-squared: 0.6022
## F-statistic: 125.5 on 4 and 325 DF, p-value: < 2.2e-16
tidy(restricted_model)
## # A tibble: 5 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 10.9 0.321 34.0 3.76e-109
## 2 years 0.0724 0.0127 5.68 3.05e- 8
## 3 gamesyr 0.0149 0.00171 8.73 1.34e- 16
## 4 bavg 0.00200 0.00136 1.47 1.42e- 1
## 5 hrunsyr 0.0339 0.00746 4.55 7.69e- 6
hrunsyr
Coefficients and
Significancefull_coef <- tidy(full_model) %>% filter(term == "hrunsyr")
rest_coef <- tidy(restricted_model) %>% filter(term == "hrunsyr")
comparison <- data.frame(
Model = c("Full Model", "Restricted Model"),
Coefficient = c(full_coef$estimate, rest_coef$estimate),
P_Value = c(full_coef$p.value, rest_coef$p.value)
)
comparison
## Model Coefficient P_Value
## 1 Full Model 0.01013740 5.428752e-01
## 2 Restricted Model 0.03393551 7.685337e-06
hrunsyr
is not statistically significant
(p ≈ 0.369), but in the restricted model it becomes highly
significant (p < 0.000001).rbisyr
is dropped.This shift suggests multicollinearity between
hrunsyr
and rbisyr
. Including both in the
model may obscure the true impact of hrunsyr
, while
removing rbisyr
allows the model to better capture its
contribution to salary.
runsyr
, fldperc
,
sbasesyr
)expanded_model <- lm(lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr + runsyr + fldperc + sbasesyr, data = mlb1)
summary(expanded_model)
##
## Call:
## lm(formula = lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr +
## runsyr + fldperc + sbasesyr, data = mlb1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.13031 -0.45220 -0.07121 0.46968 2.56783
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.8523334 2.2091271 4.912 1.44e-06 ***
## years 0.0759872 0.0127027 5.982 5.88e-09 ***
## gamesyr 0.0066853 0.0032791 2.039 0.04229 *
## bavg 0.0006762 0.0015110 0.447 0.65483
## hrunsyr 0.0127690 0.0166279 0.768 0.44310
## rbisyr 0.0038911 0.0080932 0.481 0.63099
## runsyr 0.0167610 0.0056457 2.969 0.00321 **
## fldperc 0.0005902 0.0021463 0.275 0.78350
## sbasesyr -0.0081117 0.0058711 -1.382 0.16805
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7248 on 321 degrees of freedom
## Multiple R-squared: 0.6211, Adjusted R-squared: 0.6117
## F-statistic: 65.78 on 8 and 321 DF, p-value: < 2.2e-16
tidy(expanded_model)
## # A tibble: 9 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 10.9 2.21 4.91 0.00000144
## 2 years 0.0760 0.0127 5.98 0.00000000588
## 3 gamesyr 0.00669 0.00328 2.04 0.0423
## 4 bavg 0.000676 0.00151 0.447 0.655
## 5 hrunsyr 0.0128 0.0166 0.768 0.443
## 6 rbisyr 0.00389 0.00809 0.481 0.631
## 7 runsyr 0.0168 0.00565 2.97 0.00321
## 8 fldperc 0.000590 0.00215 0.275 0.784
## 9 sbasesyr -0.00811 0.00587 -1.38 0.168
new_vars <- tidy(expanded_model) %>% filter(term %in% c("runsyr", "fldperc", "sbasesyr"))
new_vars
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 runsyr 0.0168 0.00565 2.97 0.00321
## 2 fldperc 0.000590 0.00215 0.275 0.784
## 3 sbasesyr -0.00811 0.00587 -1.38 0.168
runsyr
, fldperc
, and
sbasesyr
. Significance is typically assessed at the 5%
level.The inclusion of additional performance metrics gives a broader view
of salary determinants. The change in significance of
hrunsyr
when dropping rbisyr
highlights
multicollinearity. From the expanded model, we can
determine that from three new factors, runsyr (runs per year) is
individually significant contributors to salary since it has p-values
0.00321 which is < 5%.
bavg
, fldperc
,
and sbasesyr
restricted_joint <- lm(lsalary ~ years + gamesyr + hrunsyr + rbisyr + runsyr, data = mlb1)
anova(restricted_joint, expanded_model)
## Analysis of Variance Table
##
## Model 1: lsalary ~ years + gamesyr + hrunsyr + rbisyr + runsyr
## Model 2: lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr + runsyr +
## fldperc + sbasesyr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 324 169.75
## 2 321 168.63 3 1.1182 0.7095 0.5469
bavg
, fldperc
, and sbasesyr
are
jointly significant.bavg
,
fldperc
, and sbasesyr
are not jointly
significant in explaining salary. In other words, there is
insufficient evidence that these three variables together add
explanatory power to the model.From the joint hypothesis test, we can determine that joint
contributions of bavg
, fldperc
, and
sbasesyr
are not significant explanatory
power variables to salary.
stargazer(full_model, restricted_model, expanded_model, type = "html", title = "Model Comparison")
Dependent variable: | |||
lsalary | |||
(1) | (2) | (3) | |
years | 0.074*** | 0.072*** | 0.076*** |
(0.013) | (0.013) | (0.013) | |
gamesyr | 0.011*** | 0.015*** | 0.007** |
(0.003) | (0.002) | (0.003) | |
bavg | 0.001 | 0.002 | 0.001 |
(0.001) | (0.001) | (0.002) | |
hrunsyr | 0.010 | 0.034*** | 0.013 |
(0.017) | (0.007) | (0.017) | |
rbisyr | 0.012 | 0.004 | |
(0.007) | (0.008) | ||
runsyr | 0.017*** | ||
(0.006) | |||
fldperc | 0.001 | ||
(0.002) | |||
sbasesyr | -0.008 | ||
(0.006) | |||
Constant | 11.165*** | 10.930*** | 10.852*** |
(0.353) | (0.321) | (2.209) | |
Observations | 330 | 330 | 330 |
R2 | 0.610 | 0.607 | 0.621 |
Adjusted R2 | 0.604 | 0.602 | 0.612 |
Residual Std. Error | 0.732 (df = 324) | 0.734 (df = 325) | 0.725 (df = 321) |
F Statistic | 101.384*** (df = 5; 324) | 125.489*** (df = 4; 325) | 65.776*** (df = 8; 321) |
Note: | p<0.1; p<0.05; p<0.01 |