I am looking at whether MLB teams that hit more home runs also score more runs. This is interesting because modern baseball often focuses on power hitting, but teams can also score through getting on base, doubles, walks, and overall offensive efficiency. as a fan of the cubs, they seem to be a team that doesn’t rely on the home run ball, but is that true?
Data Collection
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.1 ✔ readr 2.2.0
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(httr)library(rvest)
Attaching package: 'rvest'
The following object is masked from 'package:readr':
guess_encoding
library(lubridate)library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
Warning: There were 11 warnings in `mutate()`.
The first warning was:
ℹ In argument: `G = as.numeric(G)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 10 remaining warnings.
year Tm G R
Min. :2020 Length:165 Min. : 58.0 Min. : 219.0
1st Qu.:2021 Class :character 1st Qu.: 162.0 1st Qu.: 619.5
Median :2022 Mode :character Median : 162.0 Median : 698.5
Mean :2022 Mean : 269.8 Mean : 1206.6
3rd Qu.:2023 3rd Qu.: 162.0 3rd Qu.: 753.5
Max. :2024 Max. :4860.0 Max. :22432.0
NA's :5 NA's :5
H HR RBI BB
Min. : 390 Min. : 51.0 Min. : 204.0 Min. : 147.0
1st Qu.: 1232 1st Qu.: 138.8 1st Qu.: 590.2 1st Qu.: 429.8
Median : 1306 Median : 176.5 Median : 667.0 Median : 492.5
Mean : 2215 Mean : 315.0 Mean : 1153.3 Mean : 857.6
3rd Qu.: 1374 3rd Qu.: 205.0 3rd Qu.: 724.5 3rd Qu.: 548.2
Max. :40839 Max. :5944.0 Max. :21512.0 Max. :15819.0
NA's :5 NA's :5 NA's :5 NA's :5
SO BA OBP SLG
Min. : 440 Min. :0.2120 Min. :0.2780 Min. :0.3400
1st Qu.: 1218 1st Qu.:0.2377 1st Qu.:0.3090 1st Qu.:0.3877
Median : 1370 Median :0.2440 Median :0.3165 Median :0.4055
Mean : 2308 Mean :0.2444 Mean :0.3163 Mean :0.4070
3rd Qu.: 1453 3rd Qu.:0.2540 3rd Qu.:0.3250 3rd Qu.:0.4243
Max. :42145 Max. :0.2760 Max. :0.3490 Max. :0.5010
NA's :5 NA's :5 NA's :5 NA's :5
OPS runs_per_game hr_per_game
Min. :0.6180 Min. :3.130 Min. :0.679
1st Qu.:0.6987 1st Qu.:4.173 1st Qu.:1.023
Median :0.7195 Median :4.472 Median :1.173
Mean :0.7234 Mean :4.494 Mean :1.182
3rd Qu.:0.7500 3rd Qu.:4.787 3rd Qu.:1.323
Max. :0.8450 Max. :5.846 Max. :1.967
NA's :5 NA's :5 NA's :5
ANALYSIS
mlb <-read_csv("mlb_batting_2020_2024.csv")
Rows: 165 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): Tm
dbl (14): year, G, R, H, HR, RBI, BB, SO, BA, OBP, SLG, OPS, runs_per_game, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Analysis and Results## Summary Statistics by Yearyear_summary <- mlb %>%group_by(year) %>%summarize(average_runs =mean(R),average_hr =mean(HR),average_obp =mean(OBP),average_slg =mean(SLG),average_ops =mean(OPS) )year_summary
# A tibble: 5 × 6
year average_runs average_hr average_obp average_slg average_ops
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2020 NA NA NA NA NA
2 2021 NA NA NA NA NA
3 2022 NA NA NA NA NA
4 2023 NA NA NA NA NA
5 2024 NA NA NA NA NA
"This table shows the average offensive production for MLB teams by season. It helps compare whether run scoring and home run production changed across the five-year period."
[1] "This table shows the average offensive production for MLB teams by season. It helps compare whether run scoring and home run production changed across the five-year period."
## Average Runs by Yearmlb %>%group_by(year) %>%summarize(average_runs =mean(R)) %>%ggplot(aes(x =factor(year), y = average_runs)) +geom_col() +labs(title ="Average Runs Scored by MLB Teams by Year",x ="Year",y ="Average Runs" )
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_col()`).
"This bar chart shows how average team scoring changed from 2020 to 2024."
[1] "This bar chart shows how average team scoring changed from 2020 to 2024."
## Average Home Runs by Yearmlb %>%group_by(year) %>%summarize(average_hr =mean(HR)) %>%ggplot(aes(x =factor(year), y = average_hr)) +geom_col() +labs(title ="Average Home Runs by MLB Teams by Year",x ="Year",y ="Average Home Runs" )
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_col()`).
"This graph shows whether home run totals increased or decreased across seasons."
[1] "This graph shows whether home run totals increased or decreased across seasons."
## Home Runs vs Runs Scoredggplot(mlb, aes(x = HR, y = R)) +geom_point() +geom_smooth(method ="lm") +labs(title ="Relationship Between Home Runs and Runs Scored",x ="Home Runs",y ="Runs Scored" )
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).
"This scatterplot compares home runs and runs scored."
[1] "This scatterplot compares home runs and runs scored."
## Correlation Between Home Runs and Runscor(mlb$HR, mlb$R)
[1] NA
"The correlation value measures the strength of the relationship between home runs and runs scored."
[1] "The correlation value measures the strength of the relationship between home runs and runs scored."
## OPS vs Runs Scoredggplot(mlb, aes(x = OPS, y = R)) +geom_point() +geom_smooth(method ="lm") +labs(title ="Relationship Between OPS and Runs Scored",x ="OPS",y ="Runs Scored" )
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5 rows containing non-finite outside the scale range (`stat_smooth()`).
Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).
Call:
lm(formula = R ~ HR + OBP + SLG, data = mlb)
Residuals:
Min 1Q Median 3Q Max
-789.40 -42.89 4.35 41.96 786.49
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.759e+01 2.481e+02 -0.272 0.78563
HR 3.838e+00 1.088e-02 352.817 < 2e-16 ***
OBP 4.215e+03 1.338e+03 3.150 0.00196 **
SLG -3.115e+03 6.354e+02 -4.902 2.36e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 119.5 on 156 degrees of freedom
(5 observations deleted due to missingness)
Multiple R-squared: 0.9987, Adjusted R-squared: 0.9987
F-statistic: 4.149e+04 on 3 and 156 DF, p-value: < 2.2e-16
"The regression model examines whether home runs, on-base percentage, and slugging percentage help explain scoring."
[1] "The regression model examines whether home runs, on-base percentage, and slugging percentage help explain scoring."
Conclusion
The results support the idea that MLB teams that hit more home runs usually score more runs. However, home runs are not the only important factor. OPS, on-base percentage, and slugging percentage also contribute to scoring success.
Overall, the analysis suggests that home runs are important, but they work best as part of a balanced offense.