R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Model Specification

The initial regression model is given by:

\[ \widehat{\log(salary)} = 11.19 + 0.0689 \cdot years + 0.0126 \cdot gamesyr + 0.00098 \cdot bavg + 0.0144 \cdot hrunsyr + 0.018 \cdot rbisyr \]

Where: - years: Number of years the player has been in the league - gamesyr: Games per year - bavg: Batting average - hrunsyr: Home runs per year - rbisyr: Runs batted in per year

Given: - \(n = 353\) - \(SSR = 183.186\) - \(R^2 = 0.6278\)

Data Import

mlb1 <- read_excel("mlb1.xlsx")
mlb1 <- na.omit(mlb1)
head(mlb1)
## # A tibble: 6 × 47
##   allstar atbats atbatsyr  bavg    bb black blackpop blckpb blckph catcher
##     <dbl>  <dbl>    <dbl> <dbl> <dbl> <dbl>    <dbl>  <dbl>  <dbl>   <dbl>
## 1    75     6705     559.   289   619     0  1547725    0      0         0
## 2    25     3333     417.   259   137     1  1547725   18.8   10.9       0
## 3     0     2807     561.   299   341     0  1547725    0      0         0
## 4     0     3337     417.   245   306     0  1547725    0      0         0
## 5     0     3603     300.   258   316     1  1547725   18.8   10.9       0
## 6    11.8   7489     441.   286   416     1  1547725   18.8   10.9       0
## # ℹ 37 more variables: doubles <dbl>, fldperc <dbl>, frstbase <dbl>,
## #   games <dbl>, gamesyr <dbl>, hispan <dbl>, hisppb <dbl>, hispph <dbl>,
## #   hisppop <dbl>, hits <dbl>, hruns <dbl>, hrunsyr <dbl>, lsalary <dbl>,
## #   nl <dbl>, outfield <dbl>, pcinc <dbl>, percblck <dbl>, perchisp <dbl>,
## #   percwhte <dbl>, rbis <dbl>, rbisyr <dbl>, runs <dbl>, runsyr <dbl>,
## #   salary <dbl>, sbases <dbl>, sbasesyr <dbl>, scndbase <dbl>, shrtstop <dbl>,
## #   slugavg <dbl>, so <dbl>, teamsal <dbl>, thrdbase <dbl>, triples <dbl>, …

Full Model (With rbisyr)

full_model <- lm(lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr, data = mlb1)
summary(full_model)
## 
## Call:
## lm(formula = lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr, 
##     data = mlb1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0117 -0.4642 -0.0524  0.4629  2.6709 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.164772   0.352595  31.665  < 2e-16 ***
## years        0.073822   0.012751   5.789 1.67e-08 ***
## gamesyr      0.011410   0.002773   4.115 4.92e-05 ***
## bavg         0.001317   0.001424   0.925    0.356    
## hrunsyr      0.010137   0.016643   0.609    0.543    
## rbisyr       0.011915   0.007453   1.599    0.111    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7319 on 324 degrees of freedom
## Multiple R-squared:  0.6101, Adjusted R-squared:  0.6041 
## F-statistic: 101.4 on 5 and 324 DF,  p-value: < 2.2e-16
tidy(full_model)
## # A tibble: 6 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) 11.2       0.353      31.7   3.38e-101
## 2 years        0.0738    0.0128      5.79  1.67e-  8
## 3 gamesyr      0.0114    0.00277     4.11  4.92e-  5
## 4 bavg         0.00132   0.00142     0.925 3.56e-  1
## 5 hrunsyr      0.0101    0.0166      0.609 5.43e-  1
## 6 rbisyr       0.0119    0.00745     1.60  1.11e-  1

Restricted Model (Without rbisyr)

restricted_model <- lm(lsalary ~ years + gamesyr + bavg + hrunsyr, data = mlb1)
summary(restricted_model)
## 
## Call:
## lm(formula = lsalary ~ years + gamesyr + bavg + hrunsyr, data = mlb1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.05255 -0.46456 -0.02836  0.47082  2.71721 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.929535   0.321197  34.028  < 2e-16 ***
## years        0.072366   0.012749   5.676 3.05e-08 ***
## gamesyr      0.014909   0.001707   8.732  < 2e-16 ***
## bavg         0.002004   0.001361   1.472    0.142    
## hrunsyr      0.033936   0.007463   4.547 7.69e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7336 on 325 degrees of freedom
## Multiple R-squared:  0.607,  Adjusted R-squared:  0.6022 
## F-statistic: 125.5 on 4 and 325 DF,  p-value: < 2.2e-16
tidy(restricted_model)
## # A tibble: 5 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) 10.9       0.321       34.0  3.76e-109
## 2 years        0.0724    0.0127       5.68 3.05e-  8
## 3 gamesyr      0.0149    0.00171      8.73 1.34e- 16
## 4 bavg         0.00200   0.00136      1.47 1.42e-  1
## 5 hrunsyr      0.0339    0.00746      4.55 7.69e-  6

Comparison of hrunsyr Coefficients and Significance

full_coef <- tidy(full_model) %>% filter(term == "hrunsyr")
rest_coef <- tidy(restricted_model) %>% filter(term == "hrunsyr")

comparison <- data.frame(
  Model = c("Full Model", "Restricted Model"),
  Coefficient = c(full_coef$estimate, rest_coef$estimate),
  P_Value = c(full_coef$p.value, rest_coef$p.value)
)
comparison
##              Model Coefficient      P_Value
## 1       Full Model  0.01013740 5.428752e-01
## 2 Restricted Model  0.03393551 7.685337e-06

Interpretation

Conclusion

This shift suggests multicollinearity between hrunsyr and rbisyr. Including both in the model may obscure the true impact of hrunsyr, while removing rbisyr allows the model to better capture its contribution to salary.

Expanded Model (Including runsyr, fldperc, sbasesyr)

expanded_model <- lm(lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr + runsyr + fldperc + sbasesyr, data = mlb1)
summary(expanded_model)
## 
## Call:
## lm(formula = lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr + 
##     runsyr + fldperc + sbasesyr, data = mlb1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.13031 -0.45220 -0.07121  0.46968  2.56783 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.8523334  2.2091271   4.912 1.44e-06 ***
## years        0.0759872  0.0127027   5.982 5.88e-09 ***
## gamesyr      0.0066853  0.0032791   2.039  0.04229 *  
## bavg         0.0006762  0.0015110   0.447  0.65483    
## hrunsyr      0.0127690  0.0166279   0.768  0.44310    
## rbisyr       0.0038911  0.0080932   0.481  0.63099    
## runsyr       0.0167610  0.0056457   2.969  0.00321 ** 
## fldperc      0.0005902  0.0021463   0.275  0.78350    
## sbasesyr    -0.0081117  0.0058711  -1.382  0.16805    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7248 on 321 degrees of freedom
## Multiple R-squared:  0.6211, Adjusted R-squared:  0.6117 
## F-statistic: 65.78 on 8 and 321 DF,  p-value: < 2.2e-16
tidy(expanded_model)
## # A tibble: 9 × 5
##   term         estimate std.error statistic       p.value
##   <chr>           <dbl>     <dbl>     <dbl>         <dbl>
## 1 (Intercept) 10.9        2.21        4.91  0.00000144   
## 2 years        0.0760     0.0127      5.98  0.00000000588
## 3 gamesyr      0.00669    0.00328     2.04  0.0423       
## 4 bavg         0.000676   0.00151     0.447 0.655        
## 5 hrunsyr      0.0128     0.0166      0.768 0.443        
## 6 rbisyr       0.00389    0.00809     0.481 0.631        
## 7 runsyr       0.0168     0.00565     2.97  0.00321      
## 8 fldperc      0.000590   0.00215     0.275 0.784        
## 9 sbasesyr    -0.00811    0.00587    -1.38  0.168

Significance of New Variables

new_vars <- tidy(expanded_model) %>% filter(term %in% c("runsyr", "fldperc", "sbasesyr"))
new_vars
## # A tibble: 3 × 5
##   term      estimate std.error statistic p.value
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>
## 1 runsyr    0.0168     0.00565     2.97  0.00321
## 2 fldperc   0.000590   0.00215     0.275 0.784  
## 3 sbasesyr -0.00811    0.00587    -1.38  0.168

Interpretation

Conclusion

The inclusion of additional performance metrics gives a broader view of salary determinants. The change in significance of hrunsyr when dropping rbisyr highlights multicollinearity. From the expanded model, we can determine that from three new factors, runsyr (runs per year) is individually significant contributors to salary since it has p-values 0.00321 which is < 5%.

Joint Significance Test of bavg, fldperc, and sbasesyr

restricted_joint <- lm(lsalary ~ years + gamesyr + hrunsyr + rbisyr + runsyr, data = mlb1)
anova(restricted_joint, expanded_model)
## Analysis of Variance Table
## 
## Model 1: lsalary ~ years + gamesyr + hrunsyr + rbisyr + runsyr
## Model 2: lsalary ~ years + gamesyr + bavg + hrunsyr + rbisyr + runsyr + 
##     fldperc + sbasesyr
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    324 169.75                           
## 2    321 168.63  3    1.1182 0.7095 0.5469

Interpretation

Conclusion

From the joint hypothesis test, we can determine that joint contributions of bavg, fldperc, and sbasesyr are not significant explanatory power variables to salary.

Appendix: Model Output Tables

stargazer(full_model, restricted_model, expanded_model, type = "html", title = "Model Comparison")
Model Comparison
Dependent variable:
lsalary
(1) (2) (3)
years 0.074*** 0.072*** 0.076***
(0.013) (0.013) (0.013)
gamesyr 0.011*** 0.015*** 0.007**
(0.003) (0.002) (0.003)
bavg 0.001 0.002 0.001
(0.001) (0.001) (0.002)
hrunsyr 0.010 0.034*** 0.013
(0.017) (0.007) (0.017)
rbisyr 0.012 0.004
(0.007) (0.008)
runsyr 0.017***
(0.006)
fldperc 0.001
(0.002)
sbasesyr -0.008
(0.006)
Constant 11.165*** 10.930*** 10.852***
(0.353) (0.321) (2.209)
Observations 330 330 330
R2 0.610 0.607 0.621
Adjusted R2 0.604 0.602 0.612
Residual Std. Error 0.732 (df = 324) 0.734 (df = 325) 0.725 (df = 321)
F Statistic 101.384*** (df = 5; 324) 125.489*** (df = 4; 325) 65.776*** (df = 8; 321)
Note: p<0.1; p<0.05; p<0.01