HW1

library(Lahman)

Warning: package 'Lahman' was built under R version 4.4.3

url <- "https://www.rossmanchance.com/chance/stat470S25/hw/fakebb.txt"
df <- read.table(url, header = TRUE)
head(df)

   AB RBI League
1 533  42     AL
2 505  61     AL
3 540  53     AL
4 530  58     AL
5 481  40     AL
6 464  48     AL

Simple Regression

model_ab <- lm(RBI ~ AB, data = df)
summary(model_ab)


Call:
lm(formula = RBI ~ AB, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-52.775 -16.115   1.697  15.551  60.155 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 314.93204    6.40239   49.19   <2e-16 ***
AB           -0.51855    0.01538  -33.72   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.78 on 198 degrees of freedom
Multiple R-squared:  0.8517,    Adjusted R-squared:  0.8509 
F-statistic:  1137 on 1 and 198 DF,  p-value: < 2.2e-16

At-bat Coefficient (-.51855)- this means for each at bat, on average we expect RBI to decrease by -0.51855, meaning there is a negative linear relationship between RBI and AB

model_ab_league <- lm(RBI ~ AB + League, data = df)
summary(model_ab_league)


Call:
lm(formula = RBI ~ AB + League, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-28.1804  -6.3759  -0.2709   6.0080  23.2048 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -49.87887   12.39089  -4.025 8.11e-05 ***
AB            0.19975    0.02468   8.094 5.88e-14 ***
LeagueNL    149.13172    4.94290  30.171  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.21 on 197 degrees of freedom
Multiple R-squared:  0.9736,    Adjusted R-squared:  0.9733 
F-statistic:  3634 on 2 and 197 DF,  p-value: < 2.2e-16

AB Coefficient (.199) this means for each at bat, on average we expect RBI to increase by .199 when taking into account the league the player is in

LeagueNL (149.13) Since league is a categorical variable, the coefficient of 149.13 represents the average RBI’s for a player in the National League (NL), holding At Bats constant

The reason why there is a difference in the At Bats coefficient is because we are adding another variable that isolates the effect of At bats on RBI’s instead of just having AB explain solely. What is interesting is the change in the coefficient from -.5 (which means each at bat leads to less RBI’s which is suspicous) to .199 In the simple model, At-Bats appeared negatively related to RBIs because of a hidden variable: League. So in the overall data, AB and RBI seem negatively related, but it’s an illusion caused by league differences, where one league (AL) has more at bats but less RBI’s

library(ggplot2)

ggplot(df, aes(x = AB, y = RBI)) +
  # SLR: One line across all data (gray)
  geom_smooth(method = "lm", se = FALSE, color = "gray40", linetype = "dashed") +

  # MLR: Separate lines for each league
  geom_smooth(aes(color = League), method = "lm", se = FALSE, linewidth = 1.2) +

  # Points
  geom_point(aes(color = League), alpha = 0.7) +

  labs(
    title = "Comparing Simple vs Multiple Regression Lines",
    subtitle = "RBI vs At-Bats (AB), with and without League accounted for",
    x = "At-Bats (AB)",
    y = "Runs Batted In (RBI)",
    color = "League"
  ) +
  theme_minimal(base_size = 14)

The gray dashed line shows us the trend of the SLR, explaining RBI with AB, The other lines show us the predicted RBI values using League and AB to explain. The MR lines have the same slope coefficients, but National League players are expected to have 149 more RBI’s with AB held constant

Derek Jeter Batting Averages: 1995 (.250) 1996 (.319) Combined 95-96 (.311)

David Justice Batting Averages: 1995 (.253) 1996 (.321) Combined (.270)

1) Runs Created Comparisson

library(dplyr)

trout_2016 <- Batting %>%
  filter(playerID == "troutmi01", yearID == 2016) %>%
  mutate(
    HBP = ifelse(is.na(HBP), 0, as.numeric(HBP)),
    X1B = H - X2B - X3B - HR,
    TB = X1B + 2 * X2B + 3 * X3B + 4 * HR,
    RC = (H + BB + HBP) * TB / (AB + BB + HBP),
    RC_per_game = RC / G
  ) %>%
  select(yearID, AB, H, BB, HBP, TB, RC, G, RC_per_game)

trout_2016

  yearID  AB   H  BB HBP  TB       RC   G RC_per_game
1   2016 549 173 116  11 302 134.0237 159   0.8429162

Bryant_2016 <- Batting %>%
  filter(playerID == "bryankr01", yearID == 2016) %>%
  mutate(
    HBP = ifelse(is.na(HBP), 0, as.numeric(HBP)),
    X1B = H - X2B - X3B - HR,
    TB = X1B + 2 * X2B + 3 * X3B + 4 * HR,
    RC = (H + BB + HBP) * TB / (AB + BB + HBP),
    RC_per_game = RC / G
  ) %>%
  select(yearID, AB, H, BB, HBP, TB, RC, G, RC_per_game)

Bryant_2016

  yearID  AB   H BB HBP  TB       RC   G RC_per_game
1   2016 603 176 75  18 334 129.0891 155   0.8328328

Cabrera_2013 <- Batting %>%
  filter(playerID == "cabremi01", yearID == 2013) %>%
  mutate(
    HBP = ifelse(is.na(HBP), 0, as.numeric(HBP)),
    X1B = H - X2B - X3B - HR,
    TB = X1B + 2 * X2B + 3 * X3B + 4 * HR,
    RC = (H + BB + HBP) * TB / (AB + BB + HBP),
    RC_per_game = RC / G
  ) %>%
  select(yearID, AB, H, BB, HBP, TB, RC, G, RC_per_game)

Cabrera_2013

  yearID  AB   H BB HBP  TB       RC   G RC_per_game
1   2013 555 193 90   5 353 156.4062 148    1.056798

Mike Trout and Kris Bryant were very close in the 2016 MVP Race, winning in their respective leagues. There Runs Created per game were almost identical, Mike Trout with .84 a game and Bryant with .832 per game. Bryant scored more home runs (39) and had more extra-base hits. The Cubs, Bryants team, was the most dominant team in the league and they won the world series. He had a historical season, but Cabrera’s 2013 season displayed peak home run efficiency. He put up 44 runs, 137 RBIs and created 156 runs in 148 games.

let’s compute Runs Created per Game (RC/Game) for all four players, then compare each to Nori Aoki in 2016 (as your “average MLB player”).

# Function to compute RC/Game for any player and year
get_rc_per_game <- function(player_id, year) {
  Batting %>%
    filter(playerID == player_id, yearID == year) %>%
    mutate(
      HBP = ifelse(is.na(HBP), 0, as.numeric(HBP)),
      X1B = H - X2B - X3B - HR,
      TB = X1B + 2 * X2B + 3 * X3B + 4 * HR,
      RC = (H + BB + HBP) * TB / (AB + BB + HBP),
      RC_per_game = RC / G
    ) %>%
    select(playerID, yearID, G, AB, H, BB, HBP, TB, RC, RC_per_game)
}

trout_2016 <- get_rc_per_game("troutmi01", 2016)
bryant_2016 <- get_rc_per_game("bryankr01", 2016)
cabrera_2013 <- get_rc_per_game("cabremi01", 2013)
aoki_2016 <- get_rc_per_game("aokino01", 2016)

# Combine for comparison
rc_comparison <- bind_rows(trout_2016, bryant_2016, cabrera_2013, aoki_2016)
rc_comparison

   playerID yearID   G  AB   H  BB HBP  TB       RC RC_per_game
1 troutmi01   2016 159 549 173 116  11 302 134.0237   0.8429162
2 bryankr01   2016 155 603 176  75  18 334 129.0891   0.8328328
3 cabremi01   2013 148 555 193  90   5 353 156.4062   1.0567983
4  aokino01   2016 118 417 118  34   9 162  56.7000   0.4805085

The Elite hitters in the MLB produced roughly 2-2.5x more Runs per Game than the average MLB player

2) TEAM OBP and SLG (2010-2016)

library(tidyverse)

# Create a team-level dataset for 2010–2016
team_data <- Teams |>
  filter(yearID >= 2010, yearID <= 2016) |>
  mutate(
    # Calculate 1B (singles)
    X1B = H - X2B - X3B - HR,
    
    # Total Bases for SLG
    TB = X1B + 2 * X2B + 3 * X3B + 4 * HR,
    
    # Slugging Percentage
    SLG = TB / AB,
    
    # On-Base Percentage (use NA_to_zero to avoid divide-by-zero)
    OBP = (H + BB + HBP) / (AB + BB + HBP + SF)
  ) |>
  select(yearID, teamID, OBP, SLG, R)  # Include Runs for later regression
head(team_data)

  yearID teamID       OBP       SLG   R
1   2010    BAL 0.3156163 0.3862081 613
2   2010    BOS 0.3390768 0.4509387 818
3   2010    CHA 0.3317403 0.4199489 752
4   2010    CLE 0.3215859 0.3783488 646
5   2010    DET 0.3351937 0.4152047 751
6   2010    KCA 0.3310076 0.3993576 676

filter(team_data, teamID == "NYA")

  yearID teamID       OBP       SLG   R
1   2010    NYA 0.3498267 0.4359619 859
2   2011    NYA 0.3433812 0.4441827 867
3   2012    NYA 0.3369355 0.4531137 804
4   2013    NYA 0.3069241 0.3758488 650
5   2014    NYA 0.3068407 0.3800255 633
6   2015    NYA 0.3228599 0.4208730 764
7   2016    NYA 0.3145750 0.4052767 680

model_ops <- lm(R ~ OBP + SLG, data = team_data)
summary(model_ops)


Call:
lm(formula = R ~ OBP + SLG, data = team_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-61.367 -15.509   0.656  14.185  74.171 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -739.44      43.87  -16.86   <2e-16 ***
OBP          2345.91     192.42   12.19   <2e-16 ***
SLG          1703.24      93.15   18.29   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.52 on 207 degrees of freedom
Multiple R-squared:  0.8852,    Adjusted R-squared:  0.8841 
F-statistic: 797.8 on 2 and 207 DF,  p-value: < 2.2e-16

plot(model_ops, which = 1)  # Residuals vs Fitted

The Residuals seem to be randomly scattered around the mean, there is a slight positive residual pattern on the far ends of the scale but it is not too worrying An R^2 value of .885 means on average, 88.5% of the variation in team runs (R) from 2010–2016 can be explained by OBP and SLG. At very small p-values <.0001, both predictors are significant to the model It seems that the OBP Stat has a stronger predicting value since it has a higher Coefficient but also a smaller T value. This means that OBP has a higher standard error so it has a stronger effect on predicting Runs but we are less confident in the estimate than SLG. However, we can overlook the standard error since both variables pass significance.

3) Pythagorean Expectation

When Comparing the Mariners and Tigers season record and Run Differential, I like to use the easy metric of +10 RD leads to 1 extra predicted win. The Mariners win % of .716 is .216 above 500, or 35 wins in a 162 game season. This would suggest an RD of 346 (300 actual) Tigers had a win % of .265 which is .235 under 500. .235 translates to 38 games which would suggest an RD of -376 (337 actual). The model produces sensible winning percentages for both teams since it is very strong.

Step 1: Playoff Series Data

library(tidyverse)

playoff_series <- tibble(
  yearID = c(2005, 2005, 2006, 2006, 2007, 2007),
  teamIDwinner = c("CHW", "STL", "STL", "DET", "BOS", "COL"),
  teamIDloser  = c("HOU", "DET", "NYM", "OAK", "CLE", "ARI")
)
team_stats <- Teams %>%
  filter(yearID %in% playoff_series$yearID) %>%
  select(yearID, teamID, W, R, RA) %>%
  mutate(RD = R - RA)

Step 2: Join with Playoffs

# Join stats for winning teams
joined_data <- playoff_series %>%
  left_join(team_stats, by = c("yearID", "teamIDwinner" = "teamID")) %>%
  rename(W_winner = W, R_winner = R, RA_winner = RA, RD_winner = RD)

# Then join stats for losing teams
joined_data <- joined_data %>%
  left_join(team_stats, by = c("yearID", "teamIDloser" = "teamID")) %>%
  rename(W_loser = W, R_loser = R, RA_loser = RA, RD_loser = RD)
comparison <- joined_data %>%
  mutate(
    team_more_wins = ifelse(W_winner > W_loser, teamIDwinner,
                      ifelse(W_loser > W_winner, teamIDloser, NA)),
    team_more_rd = ifelse(RD_winner > RD_loser, teamIDwinner,
                    ifelse(RD_loser > RD_winner, teamIDloser, NA))
  )

Step 3: Accuracy and Visualization

comparison <- comparison %>%
  mutate(
    win_predicted_correct = teamIDwinner == team_more_wins,
    rd_predicted_correct = teamIDwinner == team_more_rd
  )
accuracy_percent <- comparison %>%
  summarise(
    Wins = sum(win_predicted_correct, na.rm = TRUE),
    Run_Diff = sum(rd_predicted_correct, na.rm = TRUE),
    total_series = n()
  ) %>%
  pivot_longer(cols = c(Wins, Run_Diff), names_to = "Method", values_to = "Correct") %>%
  mutate(Percentage = round(100 * Correct / total_series, 1))

ggplot(accuracy_percent, aes(x = Method, y = Percentage, fill = Method)) +
  geom_col(width = 0.6, color = "black") +
  geom_text(aes(label = paste0(Percentage, "%")), vjust = -0.5, size = 5) +
  labs(
    title = "Prediction Accuracy (% of Correct Series Picks)",
    x = "Prediction Method",
    y = "Accuracy (%)"
  ) +
  ylim(0, 100) +
  theme_minimal() +
  theme(legend.position = "none")

Clearly, Run Difference has a much higher Playoff Series Win Prediction Accuracy than regular season Wins. Still, 50% accuracy is not that great and this could be due to Playoff series being small in sample size (best of 5 or 7). This means Random variation, hot streaks, or slumps have a stronger impact in short series. Even teams with superior season-long stats can be upset due to a small number of events.This diminishes our confidence in the predictor. Playoff baseball is also different, Managers use tighter bullpens, shorter rotations, and different strategies in October. Individual stars or clutch performers can takeover easier and shift a series, so team star power explains more variability in the postseason.

If a team trades Player A (150 runs created) for Player B (170 runs created), the overall expectation would be an increase of 20 runs or 1.2% Win% increase which is 2 games. Before the trade the team would have won 86 games or .530 and after the trade they are expected to win .542 or 88 games