Intro to Data Science Final Project

Author

Andrew Weiser

Start by installing necessary packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gapminder)
library(readr)
library(ggplot2)

Read the datasets

gameinfo <- read_csv("2025gameinfo.csv")
Rows: 2478 Columns: 43
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (27): gid, visteam, hometeam, site, daynight, fieldcond, precip, sky, w...
dbl  (12): date, number, innings, tiebreaker, timeofgame, attendance, temp, ...
lgl   (3): usedh, htbf, forfeit
time  (1): starttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
batting <- read_csv("2025batting.csv")
Rows: 73092 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): gid, id, team, stattype, site, vishome, opp, gametype, box, pbp
dbl (29): b_lp, b_seq, b_pa, b_ab, b_r, b_h, b_d, b_t, b_hr, b_rbi, b_sh, b_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pitching <- read_csv("2025pitching.csv")
Rows: 21368 Columns: 42
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): gid, id, team, stattype, site, vishome, opp, gametype, box, pbp
dbl (31): p_seq, p_ipouts, p_noout, p_bfp, p_h, p_d, p_t, p_hr, p_r, p_er, p...
lgl  (1): p_cg

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
teamstats <- read_csv("2025teamstats.csv")
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
Rows: 4956 Columns: 111
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (29): gid, team, mgr, stattype, start_l1, start_l2, start_l3, start_l4, ...
dbl (66): inn1, inn2, inn3, inn4, inn5, inn6, inn7, inn8, inn9, inn10, inn11...
lgl (16): inn13, inn14, inn15, inn16, inn17, inn18, inn19, inn20, inn21, inn...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction

This analysis will look into how time of day and the weather effected player performance throughout the 2025 MLB season.

A hobby of mine is sports betting, specifically baseball. I always knew that the weather played a large role in player performance and my hope is that I will be able to see how different players perform in different conditions. By analyzing how players perform in different conditions, I can potentially make more accurate bets on player props.

Data Analyzation

Power Hitters

Let’s next take a look the top 50 baseabll games played with the most homeruns from the 2025 season. We will also look at the temperatures of each game to see if there is a relation between homeruns and temperature.

top50 <- teamstats |>
  group_by(gid, date) |>
  summarise(total_hr = sum(b_hr, na.rm = TRUE), .groups = "drop") |>
  inner_join(gameinfo |> select(gid, temp), by = "gid") |>
  select(date, gid, temp, total_hr) |>
  arrange(desc(total_hr)) |>
  slice_head(n = 50)

Now that we have our top 50 games listed, let’s make a scatter plot to determine if there is a correlation between the temperature and homeruns hit.

ggplot(top50, aes(x = temp, y = total_hr)) +
  geom_point(size = 2, alpha = 0.7) +
  coord_cartesian(xlim = c(30, max(top50$temp, na.rm = TRUE))) +
  labs(title = "Top 50 Home Run Games by Temperature",
       x = "Temperature (°F)",
       y = "Total Home Runs per Game") +
  theme_minimal()

Wow, that sure is interesting right? Actually not really. This plot clearly shows a correlation between temperature and homeruns hit with 49 of the top 50 games with the most homeruns being over 60 degrees. Why might this be?

Simple explanation: Cold air is more dense than warm air. Baseballs traveling through cold air makes it harder for them to travel as far. This is the main reason why we see that correlation between homeruns and warm air.

Since this shows that more homeruns are hit in warmer weather, let’s turn towards player performance.

Let’s take a look at the top 250 hottest games and see who hit the most homeruns.

hottest250 <- gameinfo |>
  arrange(desc(temp)) |>
  slice_head(n = 250) |>
  select(gid, date, temp)
hottest250
# A tibble: 250 × 3
   gid              date  temp
   <chr>           <dbl> <dbl>
 1 BAL202506230 20250623   100
 2 BAL202506240 20250624   100
 3 ATH202507100 20250710    99
 4 COL202506200 20250620    98
 5 COL202506210 20250621    98
 6 NYN202506240 20250624    97
 7 BAL202507280 20250728    97
 8 BAL202507291 20250729    97
 9 KCA202507290 20250729    97
10 BAL202507300 20250730    97
# ℹ 240 more rows

This shows that the top 250 games were 86 degrees or higher.

batting |>
  inner_join(hottest250 |> select(gid), by = "gid") |>
  group_by(id) |>
  summarise(total_hr = sum(b_hr, na.rm = TRUE), .groups = "drop") |>
  arrange(desc(total_hr))
# A tibble: 1,103 × 2
   id       total_hr
   <chr>       <dbl>
 1 camij001       15
 2 diazy001       11
 3 judga001       10
 4 loweb001       10
 5 pasqv001        9
 6 jansd001        7
 7 monim001        7
 8 adelj001        6
 9 grist001        6
10 kurtn001        6
# ℹ 1,093 more rows

This analysis shows that out of those 250 hottest games, 4 players had over 10 homeruns in those games. But what is important to notes is that 3 of these players: camij001(Junior Caminero), diazy001(Yandy Diaz), and loweb001(Brandon Lowe) all played for the Tampa Bay Rays. In a normal year, the Rays play in a dome; however, in 2024, Hurricane Milton tore the roof off Tropicana Field, home of the Tampa Bay Rays. This forced the Rays to play outdoor games for the 2025 season at Steinbrenner Field in Tampa, Florida. Florida is notiriously very hot, especially in the summer, so these players played in high temperatures more than the average baseball player. jansd001 who hit 7 homeruns, played for both the Tampa Bay Rays and Milwaukee Brewers last year. We will not analyze any Tampa Bay Rays players for this analysis

The other top homerun hitter in the hottest 250 games was Aaron Judge, one of the best power hitters in baseball. Because he is one of the best power hitters, his odds of hitting a homerun on any given day is significantly higher than the average MLB player.

Instead, let’s focus on players who hit 6 or more homeruns in those 250 hottest games, that hit less than 45 homeruns in 2025: pasqv001(Vinnie Pasquantino, Kansas City Royals), monim001 (Mikey Moniak, Rockies), adelj001(Jo Adel, Angels), grist001(Trent Grisham, Yankees), kurtn001(Nick Kurtz, Athletics), laurr001(Ramon Laureano, Orioles/Padres), martk001(Ketel Marte, Diamondbacks), peres002(Salvador Perez, Kansas City Royals), suare001(Eugenio Suarez, Mariners), wittb002(Bobby Witt Jr, Royals).

Let’s make some more plots. All players listed above are known homerun hitters. Let’s make a box and whisker plot to get a better idea on the temperatures in which most of these players hit homeruns.

players <- c(
  "pasqv001","monim001","adelj001","grist001","kurtn001",
  "laurr001","martk001","peres001","suare001","wittb001"
)
batting |>
  filter(id %in% players, b_hr > 0) |>
  inner_join(gameinfo |> select(gid, temp), by = "gid") |>
  mutate(
    name = recode(id,
      "pasqv001" = "Vinnie Pasquantino",
      "monim001" = "Mikey Moniak",
      "adelj001" = "Jo Adell",
      "grist001" = "Trent Grisham",
      "kurtn001" = "Nick Kurtz",
      "laurr001" = "Ramon Laureano",
      "martk001" = "Ketel Marte",
      "peres002" = "Salvador Perez",
      "suare001" = "Eugenio Suarez",
      "wittb002" = "Bobby Witt Jr"
    )
  ) |>
  ggplot(aes(x = name, y = temp)) +
  geom_boxplot() +
  labs(
    title = "Game Temperatures When Players Hit Home Runs (2025)",
    x = "Player",
    y = "Temperature (°F)"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The first things that stand out with this plot are the smaller boxes on Ketel Marte and Eugenio Suarez. Both players play their home games in a ballpark with a retractable roof, so many games are played indoors.

The other big thing to notice is that on average, most of these players median temperature for homeruns is around 80 degrees with the exception of Trent Grisham, whose median is much lower. The code below gives us his median temperature for homeruns

batting |>
  filter(id == "grist001", b_hr > 0) |>
  inner_join(gameinfo |> select(gid, temp), by = "gid") |>
  summarise(median_temp = median(temp, na.rm = TRUE))
# A tibble: 1 × 1
  median_temp
        <dbl>
1          74

With the bottom of Trent Grisham’s box being at around 70 degrees, and his median being at 74 degrees, Trent Grisham hits about 25% of his homeruns between the temperatures of 70 and 74 degrees. It also appears that Grisham hit all of his “hottest games” homeruns below about 91 degrees, but 25% of his homeruns also come between the tight range of 83 and 91 degrees. Similar can be said about Jo Adel, as 25% of his homeruns came between about 83 and 89 degrees and all his “hottest games” homeruns were below 89 degrees.

An intriguing player on this plot is Mickey Moniak. Mickey Moniak hit 24 homeruns in 2025 which is not many compared to other guys on this list. The plot shows that he has the the higher variability in temperatures, which makes sense as Colorado has a high fluctuation of temperatures throughout the year. Based on this plot, I would never bet on Mickey Moniak to hit a homerun off of just temperatures.

Now Baseball is not made up of just power hitters. Let’s take a look at temperature effects on contact hitters

Contact Hitters

Similar to homeruns, we have hits. There are many players in baseball that are not known to be homerun hitters, but rather hitting for a high batting average.

Let’s start by making a plot to determine if there is a correlation between temperature and hits.

library(dplyr)
library(ggplot2)

batting |>
  inner_join(gameinfo |> select(gid, temp), by = "gid") |>
  mutate(temp_bin = cut(temp, breaks = seq(30, 110, by = 5))) |>
  group_by(temp_bin) |>
  summarise(avg_hits = mean(b_h, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = temp_bin, y = avg_hits, group = 1)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Average Hits by Temperature Range",
    x = "Temperature (°F bins)",
    y = "Average Hits (b_h)"
  )

This plot shows a clear increase in player hits per game when temperatures reach 85 degrees.

Let’s now take another look at those top 250 hottest games and see which players had the most hits during those games.

batting |>
  inner_join(hottest250 |> select(gid), by = "gid") |>
  group_by(id) |>
  summarise(total_h = sum(b_h, na.rm = TRUE), .groups = "drop") |>
  arrange(desc(total_h))
# A tibble: 1,103 × 2
   id       total_h
   <chr>      <dbl>
 1 diazy001      61
 2 camij001      60
 3 loweb001      43
 4 aranj001      40
 5 simpc001      37
 6 lowej002      36
 7 winnm001      34
 8 burla001      33
 9 mangj001      32
10 olsom001      31
# ℹ 1,093 more rows

Similar to the homeruns, the top 5 players are all Tampa Bay Rays, so we will ignore them, and all other Rays players. The following are 8 players we will analyze: winnm001(Masyn Winn, St. Louis Cardinals), burla001(Alec Burleson, St. Louis Cardinals), olsom001(Matt Olson, Atlanta Braves), beckj003(Jordan Beck, Colorado Rockies), bellc002(Cody Bellinger, New York Yankees), delae003(Elly De La Cruz, Cincinnati Reds), wittb002(Bobby Witt Jr, Kansas City Royals), hendg002(Gunnar Henderson, Baltimore Orioles).

We will see how these players performed in the top 250 hottest games this year.

players <- c(
  "winnm001","burla001","olsom001","beckj003","bellc002",
  "delae003","wittb002","hendg002"
)
hot250 <- gameinfo |>
  arrange(desc(temp)) |>
  slice_head(n = 250) |>
  select(gid)

batting |>
  filter(id %in% players) |>
  inner_join(hot250, by = "gid") |>
  mutate(
    name = recode(id,
      "winnm001" = "Masyn Winn",
      "burla001" = "Alec Burleson",
      "olsom001" = "Matt Olson",
      "beckj003" = "Jordan Beck",
      "bellc002" = "Cody Bellinger",
      "delae003" = "Elly De La Cruz",
      "wittb002" = "Bobby Witt Jr",
      "hendg002" = "Gunnar Henderson"
    )
  ) |>
  group_by(name) |>
  summarise(avg_hits = mean(b_h, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = reorder(name, avg_hits), y = avg_hits)) +
  geom_segment(aes(xend = name, yend = 0), color = "gray70") +
  geom_point(size = 3, color = "steelblue") +
  coord_flip() +
  labs(
    title = "Average Hits in 250 Hottest Games (2025)",
    x = "Player",
    y = "Average Hits per Game (b_h)"
  ) +
  theme_minimal()

This plot shows that these players average around 1 or more hits per game in the hottest 250 games of the year. It may be worth betting on these players to record hits during hot games.

Let’s take a deeper dive into both Matt Olson and Cody Bellinger. Let’s take a look at their batting average in games over 90 degrees.

players <- c("olsom001", "bellc002")

gameinfo90 <- gameinfo |>
  filter(temp > 90) |>
  select(gid, temp)

batting |>
  filter(id %in% players) |>
  inner_join(gameinfo90, by = "gid") |>
  group_by(id) |>
  summarise(
    total_hits = sum(b_h, na.rm = TRUE),
    total_ab = sum(b_ab, na.rm = TRUE),
    batting_avg = total_hits / total_ab
  )
# A tibble: 2 × 4
  id       total_hits total_ab batting_avg
  <chr>         <dbl>    <dbl>       <dbl>
1 bellc002          8       26       0.308
2 olsom001         13       38       0.342

This shows that both Matt Olson and Cody Bellinger hit over a .300 average which is very good in Major League Baseball. Matt Olson with a 0.342 average is phenomenal. Matt Olson may be a safe bet to record a hit on days over 90 degrees which happens quite often in Atlana, Georgia.

Let’s now take a look at games just over 90 degrees, and with players with a minimum of 20 at bats. This can help us visualize who performs best in hot weather.

gameinfo90 <- gameinfo |>
  filter(temp > 90) |>
  select(gid)

batting |>
  inner_join(gameinfo90, by = "gid") |>
  filter(team != "TB") |>   # adjust if needed
  group_by(id) |>
  summarise(
    total_hits = sum(b_h, na.rm = TRUE),
    total_ab = sum(b_ab, na.rm = TRUE),
    batting_avg = total_hits / total_ab,
    .groups = "drop"
  ) |>
  filter(total_ab >= 20) |>
  arrange(desc(batting_avg))
# A tibble: 92 × 4
   id       total_hits total_ab batting_avg
   <chr>         <dbl>    <dbl>       <dbl>
 1 martk001         15       26       0.577
 2 lopej008         13       25       0.52 
 3 domij003         10       20       0.5  
 4 freef001         13       27       0.481
 5 guerv002         18       40       0.45 
 6 peres002         10       24       0.417
 7 diazy001         18       44       0.409
 8 hoern001         10       25       0.4  
 9 perdg001         10       25       0.4  
10 westj002         16       40       0.4  
# ℹ 82 more rows

Wow! Those are some high averages. Yandy Diaz was a member of the Rays so we will exclude him moving forward. Let’s narrow this down to the top 9.

gameinfo90 <- gameinfo |>
  filter(temp > 90) |>
  select(gid)

top9 <- batting |>
  inner_join(gameinfo90, by = "gid") |>
  filter(team != "TB") |>
  filter(id != "diazy001") |>
  group_by(id) |>
  summarise(
    total_hits = sum(b_h, na.rm = TRUE),
    total_ab = sum(b_ab, na.rm = TRUE),
    batting_avg = total_hits / total_ab,
    .groups = "drop"
  ) |>
  filter(total_ab >= 20) |>
  arrange(desc(batting_avg)) |>
  slice_head(n = 9)

top9
# A tibble: 9 × 4
  id       total_hits total_ab batting_avg
  <chr>         <dbl>    <dbl>       <dbl>
1 martk001         15       26       0.577
2 lopej008         13       25       0.52 
3 domij003         10       20       0.5  
4 freef001         13       27       0.481
5 guerv002         18       40       0.45 
6 peres002         10       24       0.417
7 hoern001         10       25       0.4  
8 perdg001         10       25       0.4  
9 westj002         16       40       0.4  

Now let’s plot this for better visualization.

top9 <- top9 |>
  mutate(
    name = recode(id,
      "martk001" = "Ketel Marte",
      "lopej008" = "Joey Loperfido",
      "domij003" = "Jason Domínguez",
      "freef001" = "Freddie Freeman",
      "guerv002" = "Vladimir Guerrero Jr.",
      "peres002" = "Salvador Perez",
      "hoern001" = "Tommy Edman",
      "perdg001" = "Gavin Sheets",
      "westj002" = "Jordan Westburg"
    )
  )

ggplot(top9, aes(x = reorder(name, batting_avg), y = batting_avg)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  scale_y_continuous(labels = scales::label_number(accuracy = 0.001)) +
  labs(
    title = "Top Batting Averages in 90°F+ Games (Min 20 AB)",
    x = "Player",
    y = "Batting Average"
  ) +
  theme_minimal()

This plot illustrates how well certain players perform in hotter weather. Ketel Marte is one of the best second basemen in baseball, so seeing him excel in certain conditions is not a surprise, but a nearly .600 batting average in 90 degree weather is simply insane. Joey Loperfido is not a big name, so seeing how well he performs in hot weather games is intriguing. He may be a popular player to place bets on this summer. The same can be said about the rest of the players.

Conclusion

After an anaylysis on 2025 player performance in warm weather, it is clear that some players excel in warmer weather. It has been shown that warm weather can influence both homerun hitting since warmer air is less dense than colder air. Warm weather also influences players ability to get hits as there was a large spike in hits per game once the temperature hit 85 degrees.

It has been shown that many players excel in warmer weather. These players may be inclined to continue their performances in the 2026 season. I’m hoping that between the analysis conducted in this project, and more analysis down the line, that I will be able to place more accurate bets on players to record hits and homeruns. I plan on doing further analysis on pitcher performance in warmer weather on my own time.

Thank you.