#STATEMENT OF AI USE: # - Used AI to help interpret some regression and f-statistic results. # - Used AI to help check for collinearity. # - Used AI to check for any errors or inefficiencies.

# Aggregate Payroll Data by Team and Year
# Calculate total payroll for each team and season
team_payroll <- Salaries %>%
  group_by(teamID, yearID) %>%
  summarize(payroll = sum(salary, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'teamID'. You can override using the
## `.groups` argument.
# Step 2: Merge Payroll Data with Teams Data
# Join team payroll with Teams data on teamID and yearID
Teams <- Teams %>%
  left_join(team_payroll, by = c("teamID", "yearID"))

# Step 3: Remove Rows with Missing Payroll Data
# Filter out teams with NA payroll values to clean the dataset
Teams <- Teams %>%
  filter(!is.na(payroll))

Teams <- Teams %>%
  mutate(TeamBA = H/AB)
#Remove rows with missing payroll data
Teams <- Teams %>% 
  filter(!is.na(payroll))
#Moved #4 up in the order because other questions needed models 1 and 2 defined.
#4.#Multiple Regressions

#Model 1
model1 <- lm(W~payroll +ERA, data= Teams)


#Model 2
model2 <- lm(W~payroll + TeamBA + HR, data = Teams)


#Display results
stargazer(model1, model2, type = "text", title = "Multiple Regression Results", digits = 2)
## 
## Multiple Regression Results
## ==================================================================
##                                  Dependent variable:              
##                     ----------------------------------------------
##                                           W                       
##                               (1)                    (2)          
## ------------------------------------------------------------------
## payroll                    0.0000***              0.0000***       
##                             (0.00)                  (0.00)        
##                                                                   
## ERA                        -11.70***                              
##                             (0.56)                                
##                                                                   
## TeamBA                                            197.55***       
##                                                    (31.71)        
##                                                                   
## HR                                                 0.08***        
##                                                     (0.01)        
##                                                                   
## Constant                   125.60***                13.32*        
##                             (2.45)                  (7.89)        
##                                                                   
## ------------------------------------------------------------------
## Observations                  918                    918          
## R2                           0.37                    0.19         
## Adjusted R2                  0.36                    0.19         
## Residual Std. Error     9.44 (df = 915)        10.68 (df = 914)   
## F Statistic         263.37*** (df = 2; 915) 70.98*** (df = 3; 914)
## ==================================================================
## Note:                                  *p<0.1; **p<0.05; ***p<0.01

1. Summarize Prior Work

#In the previous project prompts, I analyzed the relationship between team payroll and win percentage in the MLB. My Key Steps were: # Aggregating payroll data for teams across seasons # Merging payroll data with team performance metrics (wins) # Conducting a simple regression analysis to assess whether higher payrolls correlated with more wins.

#My key findings were: # A posiitve correlation between payroll and wins was obserevd, but the effect size was minimal. # Omitted variable bias, reverse causality, and other potential confounders were seen as limitations.

#Improvement for inference: # Incorporate multiple regression analysis to control for additional variables. Like market size and player/coach quality. # Evaluate assumptions for Ordinary Least Squares regression. # use hypothesis testing for robust conclusions.

2. Proposed Multiple Regression Models

Model 1:

\(W_i=\beta_0 + \beta_1*Payroll_i+\beta_2 * TeamERA_i + \upsilon_i\) # Rationale: Adding a team’s earned run average (ERA), which is how many runs a team gives up per 9 innings, accounts for pitching quality, which is a critical determinant of a team’s success.

Model 2:

\(W_i = \beta_0 + \beta_1*Payroll+\beta_2*TeamBA_i+\beta_3*TeamHR_i+\upsilon_i\) # Rationale: Including batting average (BA) and home runs (HR) captures offensive contributions, which complement payroll’s impact on wins, further demonstrating player quality.

# 3. Checking Classical Assumptions

#Linearity: The relationship between the dependent variable W and each independent variable: payroll, ERA, TeamBA, and HR must be linear.

# Scatterplot to check for linearity
ggplot(Teams, aes(x = payroll, y = W)) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(Teams, aes(x = ERA, y = W)) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(Teams, aes(x = TeamBA, y = W)) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

ggplot(Teams, aes(x = HR, y = W)) + geom_point() + geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'

#Check: All assumptions show linear trends, the assumption is met.

#No perfect co-linearity: No two independent variables should be perfectly correlated with each other, meaning there should be no exact linear relationship between the predictors.
# Calculate correlation matrix for independent variables
cor_matrix <- Teams %>%
  select(payroll, ERA, TeamBA, HR) %>%
  cor()

# Calculate VIF to check for collinearity
install.packages("car")
## 
## The downloaded binary packages are in
##  /var/folders/_w/92dbpzbx3mdgs6c55jbsx64w0000gn/T//Rtmp3j9ctO/downloaded_packages
library(car)
## Loading required package: carData
## 
## Attaching package: 'carData'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     Salaries
## 
## The following object is masked from 'package:Lahman':
## 
##     Salaries
## 
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
vif_model <- lm(W ~ payroll + ERA + TeamBA + HR, data = Teams)
vif(vif_model)
##  payroll      ERA   TeamBA       HR 
## 1.160638 1.198991 1.218125 1.350869
#Check: All variables have correlations greater than 1 but less than 2, meaning there is some collinearity but the values are less than 10 indicating multicollinearity is not a problem here.

#3. Random Sampling: The data should be a random sample from the population, ensuring the results are generalizable.
#Check: The dataset includes MLB teams across various years, we can reasonably assume it approximates random sampling.

#4. Exogeneity: The predictors (independent variables) should not be correlated with the error term (unobserved factors). This means no omitted variable bias or reverse causality.
#Check: There is likely some reverse causality in this model as most statistics in baseball are correlated which makes exogeneity difficult. For example, we are testing for y (wins) as a product of x(payroll, teamBA, ERA, HR). However, payroll could be a result of wins as the more you win the more fans you get, increasing your payroll. Unfortunately there isn't really a way around this problem since so many stats in baseball are correlated, just something to keep in mind.

#5. Homoskedasticity: The variance of the error terms should be constant across all levels of the independent variables. In other words, the spread of residuals should not depend on the value of the predictors.

# Residual plot to check for homoskedasticity
residuals <- resid(model1)
fitted_values <- fitted(model1)
ggplot(data.frame(fitted_values, residuals), aes(x = fitted_values, y = residuals)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  ggtitle("Residuals vs Fitted Values (Model 1)")
## `geom_smooth()` using formula = 'y ~ x'

residuals <- resid(model2)
fitted_values <- fitted(model2)
ggplot(data.frame(fitted_values, residuals), aes(x = fitted_values, y = residuals)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  ggtitle("Residuals vs Fitted Values (Model 2)")
## `geom_smooth()` using formula = 'y ~ x'

#Check:
#For both models the residuals are randomly scattered around zero with no discernable pattern, therefore the assumption of homoskedascity holds.

#5. Interpret Results:

#Model 1: # Key findings: # Payroll: The coefficient is statistically significant (p<0.01) but very close to 0, suggesting that while payroll has a positive effect on wins, the effect size is negligible after controlling for ERA. # ERA: The coefficient of (-11.70) if highly significant (p<0.01) and negative, indicating that teams with lower ERA, which means better pitching, tend to win more games. For every one-unit decrease in ERA, a team is expected to win approximately 11.7 additional games, holding payroll constant. # Constant: A team with zero payroll and an ERA of zero would theoretically win 125.6 games, which serves as an unrealistic intercept but is necessary for the model. #Model fit: \(R^2=0.37\) #About 37% of the variance in team wins is explained by payroll and ERA. #Residual standard error: 9.44 wins, suggesting moderate unexplained variation. # F-statistic: Significant, indicating that the model as a whole is robust.

Model 2:

Key findings:

Payroll: Similar to Model 1, payroll is statistically significant but its effect size is extremely small, suggesting minimal practical impact.

TeamBA: The coefficient (197.55) is highly significant (p<0.01), indicating that batting average has a strong positive relationship with wins. A 0.01 increase in team batting average (e.g., 0.270 to 0.280) is associated with nearly 2 more wins.

HR: The coefficient (0.08) is highly significant (p<0.01), suggesting that each additional homerun contributes a small but measurable increase of 0.08 wins.

Constant: The intercept (13.32) indicates the baseline number of wins when payroll, TeamBA, and HR are all zero, even though this scenario will never happen.

Model fit:

\(R^2 = 0.19\) : # About 19% of the variance in wins is explained by payroll, batting average, and home runs. This is lower than model 1 which indicates that ERA is more important for team success than batting average and home runs. # Residual standard error: 10.68 wins, which is a higher variation than model 1. #F statistic: Signifcant, indicating that the model is statistically robust.

# 6. Confidence Intervals

  confint(model1, level = 0.95)
##                     2.5 %        97.5 %
## (Intercept)  1.207836e+02  1.304073e+02
## payroll      4.462666e-08  7.295973e-08
## ERA         -1.279862e+01 -1.059362e+01
  confint(model2, level = 0.95)
##                     2.5 %       97.5 %
## (Intercept) -2.155539e+00 2.879801e+01
## payroll      2.722480e-08 6.114901e-08
## TeamBA       1.353236e+02 2.597860e+02
## HR           5.689590e-02 9.895758e-02
#Interpretation of results:
#Model1:
  #Intercept - CI: [120.78,130.41]
    #Interpretation: Interpretation: The expected number of wins for a team with zero payroll and an ERA of zero (hypothetically) lies between 120.78 and 130.41. Although this scenario is unrealistic, the intercept adjusts the model for scaling.
  #Payroll - CI: [4.46 *10 ^-8, 7.30 *10^-8]
    #Interpretation:  Payroll's effect on wins is small but consistently positive. A one-unit increase in payroll (e.g., $1) contributes between [4.46 *10 ^-8, 7.30 *10^-8] wins. This minimal effect highlights payroll's negligible direct impact after controlling for ERA.
  #ERA - CI: [-12.80, -10.59]
    #Interpretation: ERA has a strong negative effect on wins. For every one-unit decrease in ERA, a team is expected to win between 10.59 and 12.80 additional games. This robust interval underscores ERA's critical role in determining team success.
  
#Model 2:
  #Intercept - CI: [-2.16, 28.80]
    #Interpretation: The baseline number of wins when payroll, TeamBA, and HR are all zero lies between -2.16 and 28.80. The inclusion of zero suggests the intercept is not statistically significant at the 95% level, which aligns with the unrealistic scenario of all predictors being zero.
  #Payroll - CI: [2.72 * 10^-8, 6.11 *10^-8]
    #Interpretation: Interpretation: As in Model 1, payroll's impact on wins is small but positive, reinforcing its limited practical significance.
  #TeamBA - CI: [135.32,259.79]
    #Interpretation: Team batting average has a large and statistically significant positive effect on wins. For a 0.01 increase in batting average, wins increase by 1.35 to 2.60 games. This demonstrates the importance of offensive consistency.
  #Home Runs - CI: [0.057,0.099]
    #Interpretation: Home runs contribute positively to wins. Each additional home run is associated with an increase in wins between 0.057 and 0.099. While the effect size is small, it is statistically significant.
# 7. F-test for joint significance

anova(model1)
## Analysis of Variance Table
## 
## Response: W
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## payroll     1   8315    8315   93.25 < 2.2e-16 ***
## ERA         1  38652   38652  433.48 < 2.2e-16 ***
## Residuals 915  81588      89                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model2)
## Analysis of Variance Table
## 
## Response: W
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## payroll     1   8315  8314.8  72.890 < 2.2e-16 ***
## TeamBA      1   9944  9944.4  87.175 < 2.2e-16 ***
## HR          1   6032  6032.4  52.882 7.609e-13 ***
## Residuals 914 104263   114.1                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Results:
  #The hypothesis that the coefficients on the independent variables are jointly significant is strongly supported by the ANOVA results for both models. In Model 1, the F-statistic for payroll and ERA (433.48) and its corresponding p-value (< 2.2e-16) indicate that these variables together significantly explain the variation in wins. Similarly, in Model 2, the F-statistic (87.18) and its p-value (< 2.2e-16) confirm the joint significance of payroll, TeamBA, and HR. These results demonstrate that the independent variables in both models collectively provide a meaningful contribution to predicting team wins, rejecting the null hypothesis that their coefficients are jointly zero.
# 8. Analysis of Residuals

#Residuals for Model 1
residuals1 <- resid(model1)
ggplot(data.frame(residuals1), aes(x = residuals1)) +
  geom_density(fill = "lightblue", alpha = 0.5) +
  ggtitle("Density of Residuals (Model 1)") +
  xlab("Residuals")

# Residuals for Model 2
residuals2 <- resid(model2)
ggplot(data.frame(residuals2), aes(x = residuals2)) +
  geom_density(fill = "lightgreen", alpha = 0.5) +
  ggtitle("Density of Residuals (Model 2)") +
  xlab("Residuals")

#Analysis:
  #Model 1:
    # - The residuals exhibit a bell-shaped distribution, suggesting approximate normality.
    # - The peak is centered around 0, which is consistent with a well-fitted model.
  #Model 2:
    # - The residuals in Model 2 also show a bell-shaped curve and are centered around 0.
    # - The spread (variance) appears slightly wider compared to Model 1, which may indicate that Model 2             accounts for some, but not all, of the variability in the data.

#Conclusion: The residual density plots for both the multiple regression model (Model 2) and the simple regression model (Model 1) suggest that the error terms are approximately normally distributed. Both distributions exhibit a symmetric, bell-shaped curve centered around zero, which aligns with the normality assumption in regression analysis. However, comparing the two models reveals some differences in the spread of the residuals: Model 1 appears to have a slightly narrower distribution, indicating less variability in the residuals. This could suggest that the simple regression model captures less of the variability in the dependent variable compared to the multiple regression model. Overall, both models meet the assumption of normality for the residuals, but Model 2 might better accommodate more variability due to the inclusion of additional predictors.

9. Multiple Regression Analysis Inferences:

Model 1: Payroll and ERA

Payroll: Although statistically significant, the effect of payroll on wins is minimal, suggesting that while higher payrolls may lead to more wins, their direct impact is limited after accounting for pitching quality.

ERA: The coefficient for ERA is strongly negative and highly significant, indicating that teams with better pitching (lower ERA) tend to win more games. This highlights the critical role of pitching in team success.

Model Fit: About 37% of the variance in wins is explained by payroll and ERA, showing a moderately strong explanatory power.

Model 2: Payroll, TeamBA, and HR

Payroll: Similarly, payroll is statistically significant but its practical impact remains negligible.

TeamBA: Batting average has a strong positive effect on wins, reflecting the importance of consistent offensive performance.

HR: Home runs also contribute positively to wins, but their effect size is smaller compared to batting average.

Model Fit: Only 19% of the variance in wins is explained, which is lower than Model 1. This suggests that ERA (captured in Model 1) plays a more dominant role in predicting wins than batting metrics like TeamBA and HR.

Comparison to simple regression analysis:

In earlier simple regression analysis (using only payroll as a predictor), payroll showed a positive but weak association with wins. However:

#Control for Confounding Variables: In the multiple regression models, additional predictors (ERA, TeamBA, HR) reveal stronger and more meaningful relationships with wins than payroll alone. #Improved Explanatory Power: The inclusion of ERA in Model 1 substantially increases the explanatory power compared to the simple regression, highlighting the importance of pitching quality in baseball. #Different Model Fits: The better fit of Model 1 compared to Model 2 indicates that defensive metrics (like ERA) are more critical for predicting wins than offensive metrics (like batting average and home runs).

Sources of differences:

Omitted Variable Bias: The simple regression analysis suffered from omitted variable bias, as it did not account for key factors like ERA, batting average, or home runs, leading to an overestimation of payroll’s effect on wins.

Additional Predictors: Multiple regression controls for these confounding variables, isolating the unique contribution of each predictor.

Variance Explained: The inclusion of predictors like ERA (Model 1) explains more variance in wins compared to offensive metrics (Model 2), indicating that different aspects of team performance have varying levels of importance.

10. Where does the project go from here?

I think that I’ve laid a solid foundation for analyzing payroll. By including offensive and defensive metrics, I’m able to get a better understanding of which component of the game better contributes to team success. I can always be more specific though, and I think that’s where this project can go next. There are many different positions in baseball which means that breaking performance down to two categories (offense and defense) is not specific enough. If I could analyze results by positions like starting pitcher, relief pitcher,outfield, shortstop, etc. then I could determine exactly which positions managers should be giving bigger contracts to. That way teams can buy the right players based off which position is the most necessary for success. This will allow teams to spend their payroll in the most efficient way possible.