The game of baseball has been near and dear to my heart ever since I can remember. From playing the game all throughout school to watching on a daily basis over the summer, baseball will always be a part of my life. For my final project, I used a database from kaggle that has a wide variety of statistics for the games played in the 2016 MLB season. I decided to focus on weather and its impact on the total runs scored in a game. Before completing this analysis, it’s important to note some factors that went in. There are some teams that play in dome stadiums, so weather does not have any impact on those games. Additionally, baseball mainly takes place over the summer where the temperature is pretty consistent. First I will analyze the dataset as a whole to uncover any noticeable trends. Afterwards, I will look at a more specific case of teams to discover any additional patterns.

For more information on the data set used in this project, click here: Baseball Reference Kaggle Dataset

Load Required Packages

library(tidyverse)
library(tidytext)
library(readr)
baseball_reference_2016_clean <- read_csv("/Users/michaelfaccibene/Documents/UpperClassman/Junior Year/Spring/MEA 329/baseball_reference_2016_clean.csv")
## Warning: Missing column names filled in: 'X1' [1]
baseball_reference_2016_clean %>%
    mutate(april = str_detect(baseball_reference_2016_clean$date, "^4") )  ->   baseball_reference_2016_clean

The first part of this analysis looks at the league as a whole. I created a scatter plot on total runs scored vs. the wind speed (in mph) of the games within the dataset.

As you can see, there are a lot of data points in this set and they are spread out relatively evenly, with the exception of a few outliers. Since the datapoints are clustered very closely together, it’s hard to say whether or not there is any correlation between total runs and wind speed within this dataset.

attach(baseball_reference_2016_clean)
## The following objects are masked from baseball_reference_2016_clean (pos = 3):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
plot(total_runs, temperature, main="Total Runs vs. Temperature",
     xlab="Total Runs", ylab="Temperature", pch=10)
abline(lm(total_runs~temperature), col="red")
lines(lowess(total_runs~temperature), col="blue")

This is a scatter plot that displays total runs vs. temperature during the game. Similarly to the plot on total runs vs. wind speed, there are a lot of data points to analyze. In this specific plot, the data points are are high up on the temperature scale because the majority of baseball games take place during the summer. Because of this, there’s no insight that can be made from either of these plots. So, I decided to narrow down this analysis to focus on five teams in the month of April. This way, we will hopefully get a clearer picture if weather actually has an affect on total runs. The code below filters out the teams with the top five coldest home games in the month of April.

baseball_reference_2016_clean %>% 
  filter(april %in% TRUE) %>% 
  arrange(temperature) %>% 
  head(5) %>%
  knitr::kable()
X1 attendance away_team away_team_errors away_team_hits away_team_runs date field_type game_type home_team home_team_errors home_team_hits home_team_runs start_time venue day_of_week temperature wind_speed wind_direction sky total_runs game_hours_dec season home_team_win home_team_loss home_team_outcome april
2409 32419 New York Yankees 1 13 8 4/9/16 on grass Day Game Detroit Tigers 1 7 4 1:10 p.m. Local Comerica Park Saturday 31 18 from Left to Right Cloudy 12 3.333333 regular season 0 1 Loss TRUE
2412 20192 Cleveland Indians 3 7 3 4/9/16 on grass Day Game Chicago White Sox 1 10 7 1:10 p.m. Local U.S. Cellular Field Saturday 32 13 in from Rightfield Sunny 10 2.716667 regular season 1 0 Win TRUE
2445 34493 Boston Red Sox 0 11 6 4/5/16 on grass Day Game Cleveland Indians 1 5 2 1:11 p.m. Local Progressive Field Tuesday 34 8 from Left to Right Sunny 8 3.216667 regular season 0 1 Loss TRUE
10 47820 Houston Astros 0 6 5 4/5/16 on grass Day Game New York Yankees 1 4 3 1:11 p.m. Local Yankee Stadium III Tuesday 36 18 from Left to Right Sunny 8 3.283333 regular season 0 1 Loss TRUE
2411 22799 Pittsburgh Pirates 1 10 1 4/9/16 on grass Day Game Cincinnati Reds 1 8 5 1:11 p.m. Local Great American Ball Park Saturday 38 19 from Left to Right Cloudy 6 3.000000 regular season 1 0 Win TRUE

I decided to focus my case study on these five teams, with the exception of the New York Yankees. Despite them having the fourth coldest temperature in April, it didn’t make sense to analyze only one of the teams in Chicago. Because of this,I chose to replace the Yankees with the Chicago Cubs. The code below filters out all of the night games (within the dataset) that these five teams played in the month of April.

baseball_reference_2016_clean %>% 
  filter(`home_team` %in% c("Chicago Cubs", "Chicago White Sox", "Detroit Tigers", "Cleveland Indians", "Cincinnati Reds") & april %in% TRUE & `game_type` %in% "Night Game") -> Nighttop5

Now that I have a clear picture of the games that these teams played in April, I can plot these data points and see if there is any correlation.

attach(Nighttop5)
## The following objects are masked from baseball_reference_2016_clean (pos = 3):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 4):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
plot(total_runs, temperature, main="Total Runs vs. Temperature",
     xlab="Total Runs", ylab="Temperature", pch=10)
abline(lm(total_runs~temperature), col="red")

The plot above shows a slight positive correlation between total runs and temperature. There are a few outliers within this filtered dataset. There are three games where less than 10 total runs were scored, while the temperature was between 70 and 80 degrees. This could be because there were two very good pitchers throwing who shut down the opposing offenses. Outside of these three points, you can see that as the temperature gets warmer, slightly more runs are scored.

The code below represents the scatterplot for total runs vs. wind speed within the filtered dataset.

attach(Nighttop5)
## The following objects are masked from Nighttop5 (pos = 3):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 4):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 5):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
plot(total_runs, wind_speed, main="Total Runs vs. Wind Speed",
     xlab="Total Runs", ylab="Wind Speed", pch=10)
abline(lm(total_runs~temperature), col="red")

The plot above shows a much less clear correlation between the two variables. The data points are scattered all over the plot and no real correlation can be derived. The next part of my analysis looks at the same filtered for these five teams. However, instead of focusing on the night games, I want to take a look at the day games and see if they prove or dis-prove the correlations from the night games.

The code below filters out the five teams and the day games from the original dataset.

baseball_reference_2016_clean %>% 
  filter(`home_team` %in% c("Chicago Cubs", "Chicago White Sox", "Detroit Tigers", "Cleveland Indians", "Cincinnati Reds") & april %in% TRUE & `game_type` %in% "Day Game") -> Daytop5

Now that the data has been filtered, I will create similar scatter plots to the ones that analyze the night games.

attach(Daytop5)
## The following objects are masked from Nighttop5 (pos = 3):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
## The following objects are masked from Nighttop5 (pos = 4):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 5):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 6):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
plot(total_runs, temperature, main="Total Runs vs. Temperature",
     xlab="Total Runs", ylab="Temperature", pch=10)
abline(lm(total_runs~temperature), col="red")

Similarly to the plot on the night games, this plot shows a slight positive correlation between total runs and temperature. Like the other plot, there are some outliers. However, baseball is a sport with a lot of unpredictable factors. So for the majority of these data points, there is a positive correlation between these variables.

attach(Daytop5)
## The following objects are masked from Daytop5 (pos = 3):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
## The following objects are masked from Nighttop5 (pos = 4):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
## The following objects are masked from Nighttop5 (pos = 5):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 6):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 7):
## 
##     april, attendance, away_team, away_team_errors, away_team_hits,
##     away_team_runs, date, day_of_week, field_type, game_hours_dec,
##     game_type, home_team, home_team_errors, home_team_hits,
##     home_team_loss, home_team_outcome, home_team_runs, home_team_win,
##     season, sky, start_time, temperature, total_runs, venue,
##     wind_direction, wind_speed, X1
plot(total_runs, wind_speed, main="Total Runs vs. Wind Speed",
     xlab="Total Runs", ylab="Wind Speed", pch=10)
abline(lm(total_runs~temperature), col="red")

The lack of correlation for the night games definitely stayed consistent with the day games. Overall, it appears as though wind speed does not play role in influencing the total runs scored in a baseball game. The opposite can be said for temperature. With the exception of a few games where the starting pitchers were exceptional, it appears as though the warmer the temperature is, the higher the likelihood that more runs will be scored.

The game of baseball is at a crossroads. As the NFL and NBA continue to gain popularity, Major League Baseball is at risk of losing a lot of viewers. The majority of people who watch baseball hope to see a high scoring game with lots of hits. The outcome of this study showing that total runs scored and temperature have a positive correlation could be useful information to the decision makers of MLB. Perhaps in order to gain viewers, MLB could push back the start or shorten their season to ensure more runs are scored.