Assignment 7 - Lucas Quintos

MLB Team Offense Analysis

I am looking at whether MLB teams that hit more home runs also score more runs. This is interesting because modern baseball often focuses on power hitting, but teams can also score through getting on base, doubles, walks, and overall offensive efficiency. as a fan of the cubs, they seem to be a team that doesn’t rely on the home run ball, but is that true?

Data Collection

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(httr)
library(rvest)

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding
library(lubridate)
library(magrittr)

Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract
set_config(
  user_agent(
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
    AppleWebKit/537.36 (KHTML, like Gecko) 
    Chrome/124.0 Safari/537.36"
  )
)


mlb_scrape <- function(year_input) {

  baseball_url <-
    paste(
      "https://www.baseball-reference.com/leagues/majors/",
      year_input,
      "-standard-batting.shtml",
      sep = ""
    )

  
  baseball_html <-
    read_html(baseball_url)

  Sys.sleep(3)

  
  batting_table <-
    baseball_html %>%
    html_element("#teams_standard_batting") %>%
    html_table()

  
  batting_table <-
    batting_table %>%
    filter(Tm != "LgAvg") %>%
    mutate(year = year_input)

  
  return(batting_table)

}


years <- c(2020, 2021, 2022, 2023, 2024)


mlb_data <- data.frame()

for(i in seq_along(years)) {

  mlb_data <-
    mlb_scrape(years[i]) %>%
    bind_rows(mlb_data)

  print(
    paste(
      "Collecting season",
      years[i]
    )
  )

  Sys.sleep(3)

}
[1] "Collecting season 2020"
[1] "Collecting season 2021"
[1] "Collecting season 2022"
[1] "Collecting season 2023"
[1] "Collecting season 2024"
mlb_data_clean <-
  mlb_data %>%
  select(
    year,
    Tm,
    G,
    R,
    H,
    HR,
    RBI,
    BB,
    SO,
    BA,
    OBP,
    SLG,
    OPS
  ) %>%
  mutate(
    
    G = as.numeric(G),
    R = as.numeric(R),
    H = as.numeric(H),
    HR = as.numeric(HR),
    RBI = as.numeric(RBI),
    BB = as.numeric(BB),
    SO = as.numeric(SO),
    BA = as.numeric(BA),
    OBP = as.numeric(OBP),
    SLG = as.numeric(SLG),
    OPS = as.numeric(OPS),

    runs_per_game = R / G,
    hr_per_game = HR / G
    
  )
Warning: There were 11 warnings in `mutate()`.
The first warning was:
ℹ In argument: `G = as.numeric(G)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 10 remaining warnings.
write_csv(
  mlb_data_clean,
  "mlb_batting_2020_2024.csv"
)


head(mlb_data_clean)
# A tibble: 6 × 15
   year Tm         G     R     H    HR   RBI    BB    SO    BA   OBP   SLG   OPS
  <dbl> <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  2024 Arizo…   162   886  1452   211   845   569  1265 0.263 0.337 0.44  0.777
2  2024 Atlan…   162   704  1333   213   674   485  1461 0.243 0.309 0.415 0.724
3  2024 Balti…   162   786  1391   235   759   489  1359 0.25  0.315 0.435 0.751
4  2024 Bosto…   162   751  1404   194   724   493  1570 0.252 0.319 0.423 0.741
5  2024 Chica…   162   736  1318   170   696   546  1362 0.242 0.317 0.393 0.71 
6  2024 Chica…   162   507  1187   133   485   395  1403 0.221 0.278 0.34  0.618
# ℹ 2 more variables: runs_per_game <dbl>, hr_per_game <dbl>
summary(mlb_data_clean)
      year           Tm                  G                R          
 Min.   :2020   Length:165         Min.   :  58.0   Min.   :  219.0  
 1st Qu.:2021   Class :character   1st Qu.: 162.0   1st Qu.:  619.5  
 Median :2022   Mode  :character   Median : 162.0   Median :  698.5  
 Mean   :2022                      Mean   : 269.8   Mean   : 1206.6  
 3rd Qu.:2023                      3rd Qu.: 162.0   3rd Qu.:  753.5  
 Max.   :2024                      Max.   :4860.0   Max.   :22432.0  
                                   NA's   :5        NA's   :5        
       H               HR              RBI                BB         
 Min.   :  390   Min.   :  51.0   Min.   :  204.0   Min.   :  147.0  
 1st Qu.: 1232   1st Qu.: 138.8   1st Qu.:  590.2   1st Qu.:  429.8  
 Median : 1306   Median : 176.5   Median :  667.0   Median :  492.5  
 Mean   : 2215   Mean   : 315.0   Mean   : 1153.3   Mean   :  857.6  
 3rd Qu.: 1374   3rd Qu.: 205.0   3rd Qu.:  724.5   3rd Qu.:  548.2  
 Max.   :40839   Max.   :5944.0   Max.   :21512.0   Max.   :15819.0  
 NA's   :5       NA's   :5        NA's   :5         NA's   :5        
       SO              BA              OBP              SLG        
 Min.   :  440   Min.   :0.2120   Min.   :0.2780   Min.   :0.3400  
 1st Qu.: 1218   1st Qu.:0.2377   1st Qu.:0.3090   1st Qu.:0.3877  
 Median : 1370   Median :0.2440   Median :0.3165   Median :0.4055  
 Mean   : 2308   Mean   :0.2444   Mean   :0.3163   Mean   :0.4070  
 3rd Qu.: 1453   3rd Qu.:0.2540   3rd Qu.:0.3250   3rd Qu.:0.4243  
 Max.   :42145   Max.   :0.2760   Max.   :0.3490   Max.   :0.5010  
 NA's   :5       NA's   :5        NA's   :5        NA's   :5       
      OPS         runs_per_game    hr_per_game   
 Min.   :0.6180   Min.   :3.130   Min.   :0.679  
 1st Qu.:0.6987   1st Qu.:4.173   1st Qu.:1.023  
 Median :0.7195   Median :4.472   Median :1.173  
 Mean   :0.7234   Mean   :4.494   Mean   :1.182  
 3rd Qu.:0.7500   3rd Qu.:4.787   3rd Qu.:1.323  
 Max.   :0.8450   Max.   :5.846   Max.   :1.967  
 NA's   :5        NA's   :5       NA's   :5      

ANALYSIS

mlb <-
  read_csv("mlb_batting_2020_2024.csv")
Rows: 165 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Tm
dbl (14): year, G, R, H, HR, RBI, BB, SO, BA, OBP, SLG, OPS, runs_per_game, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Analysis and Results

## Summary Statistics by Year

year_summary <-
  mlb %>%
  group_by(year) %>%
  summarize(
    average_runs = mean(R),
    average_hr = mean(HR),
    average_obp = mean(OBP),
    average_slg = mean(SLG),
    average_ops = mean(OPS)
  )

year_summary
# A tibble: 5 × 6
   year average_runs average_hr average_obp average_slg average_ops
  <dbl>        <dbl>      <dbl>       <dbl>       <dbl>       <dbl>
1  2020           NA         NA          NA          NA          NA
2  2021           NA         NA          NA          NA          NA
3  2022           NA         NA          NA          NA          NA
4  2023           NA         NA          NA          NA          NA
5  2024           NA         NA          NA          NA          NA
"This table shows the average offensive production for MLB teams by season. It helps compare whether run scoring and home run production changed across the five-year period."
[1] "This table shows the average offensive production for MLB teams by season. It helps compare whether run scoring and home run production changed across the five-year period."
## Average Runs by Year

mlb %>%
  group_by(year) %>%
  summarize(average_runs = mean(R)) %>%
  ggplot(aes(x = factor(year), y = average_runs)) +
  geom_col() +
  labs(
    title = "Average Runs Scored by MLB Teams by Year",
    x = "Year",
    y = "Average Runs"
  )
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_col()`).

"This bar chart shows how average team scoring changed from 2020 to 2024."
[1] "This bar chart shows how average team scoring changed from 2020 to 2024."
## Average Home Runs by Year

mlb %>%
  group_by(year) %>%
  summarize(average_hr = mean(HR)) %>%
  ggplot(aes(x = factor(year), y = average_hr)) +
  geom_col() +
  labs(
    title = "Average Home Runs by MLB Teams by Year",
    x = "Year",
    y = "Average Home Runs"
  )
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_col()`).

"This graph shows whether home run totals increased or decreased across seasons."
[1] "This graph shows whether home run totals increased or decreased across seasons."
## Home Runs vs Runs Scored

ggplot(mlb, aes(x = HR, y = R)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "Relationship Between Home Runs and Runs Scored",
    x = "Home Runs",
    y = "Runs Scored"
  )
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).

"This scatterplot compares home runs and runs scored."
[1] "This scatterplot compares home runs and runs scored."
## Correlation Between Home Runs and Runs


cor(mlb$HR, mlb$R)
[1] NA
"The correlation value measures the strength of the relationship between home runs and runs scored."
[1] "The correlation value measures the strength of the relationship between home runs and runs scored."
## OPS vs Runs Scored


ggplot(mlb, aes(x = OPS, y = R)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    title = "Relationship Between OPS and Runs Scored",
    x = "OPS",
    y = "Runs Scored"
  )
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5 rows containing non-finite outside the scale range (`stat_smooth()`).
Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).

"This graph compares OPS and runs scored."
[1] "This graph compares OPS and runs scored."
## Regression Model


model <-
  lm(R ~ HR + OBP + SLG, data = mlb)

summary(model)

Call:
lm(formula = R ~ HR + OBP + SLG, data = mlb)

Residuals:
    Min      1Q  Median      3Q     Max 
-789.40  -42.89    4.35   41.96  786.49 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.759e+01  2.481e+02  -0.272  0.78563    
HR           3.838e+00  1.088e-02 352.817  < 2e-16 ***
OBP          4.215e+03  1.338e+03   3.150  0.00196 ** 
SLG         -3.115e+03  6.354e+02  -4.902 2.36e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 119.5 on 156 degrees of freedom
  (5 observations deleted due to missingness)
Multiple R-squared:  0.9987,    Adjusted R-squared:  0.9987 
F-statistic: 4.149e+04 on 3 and 156 DF,  p-value: < 2.2e-16
"The regression model examines whether home runs, on-base percentage, and slugging percentage help explain scoring."
[1] "The regression model examines whether home runs, on-base percentage, and slugging percentage help explain scoring."

Conclusion

The results support the idea that MLB teams that hit more home runs usually score more runs. However, home runs are not the only important factor. OPS, on-base percentage, and slugging percentage also contribute to scoring success.

Overall, the analysis suggests that home runs are important, but they work best as part of a balanced offense.

References

Baseball-Reference
https://www.baseball-reference.com