The game of baseball has been near and dear to my heart ever since I can remember. From playing the game all throughout school to watching on a daily basis over the summer, baseball will always be a part of my life. For my final project, I used a database from kaggle that has a wide variety of statistics for the games played in the 2016 MLB season. I decided to focus on weather and its impact on the total runs scored in a game. Before completing this analysis, it’s important to note some factors that went in. There are some teams that play in dome stadiums, so weather does not have any impact on those games. Additionally, baseball mainly takes place over the summer where the temperature is pretty consistent. First I will analyze the dataset as a whole to uncover any noticeable trends. Afterwards, I will look at a more specific case of teams to discover any additional patterns.
For more information on the data set used in this project, click here: Baseball Reference Kaggle Dataset
library(tidyverse)
library(tidytext)
library(readr)
baseball_reference_2016_clean <- read_csv("/Users/michaelfaccibene/Documents/UpperClassman/Junior Year/Spring/MEA 329/baseball_reference_2016_clean.csv")
## Warning: Missing column names filled in: 'X1' [1]
baseball_reference_2016_clean %>%
mutate(april = str_detect(baseball_reference_2016_clean$date, "^4") ) -> baseball_reference_2016_clean
The first part of this analysis looks at the league as a whole. I created a scatter plot on total runs scored vs. the wind speed (in mph) of the games within the dataset.
As you can see, there are a lot of data points in this set and they are spread out relatively evenly, with the exception of a few outliers. Since the datapoints are clustered very closely together, it’s hard to say whether or not there is any correlation between total runs and wind speed within this dataset.
attach(baseball_reference_2016_clean)
## The following objects are masked from baseball_reference_2016_clean (pos = 3):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
plot(total_runs, temperature, main="Total Runs vs. Temperature",
xlab="Total Runs", ylab="Temperature", pch=10)
abline(lm(total_runs~temperature), col="red")
lines(lowess(total_runs~temperature), col="blue")
This is a scatter plot that displays total runs vs. temperature during the game. Similarly to the plot on total runs vs. wind speed, there are a lot of data points to analyze. In this specific plot, the data points are are high up on the temperature scale because the majority of baseball games take place during the summer. Because of this, there’s no insight that can be made from either of these plots. So, I decided to narrow down this analysis to focus on five teams in the month of April. This way, we will hopefully get a clearer picture if weather actually has an affect on total runs. The code below filters out the teams with the top five coldest home games in the month of April.
baseball_reference_2016_clean %>%
filter(april %in% TRUE) %>%
arrange(temperature) %>%
head(5) %>%
knitr::kable()
| X1 | attendance | away_team | away_team_errors | away_team_hits | away_team_runs | date | field_type | game_type | home_team | home_team_errors | home_team_hits | home_team_runs | start_time | venue | day_of_week | temperature | wind_speed | wind_direction | sky | total_runs | game_hours_dec | season | home_team_win | home_team_loss | home_team_outcome | april |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2409 | 32419 | New York Yankees | 1 | 13 | 8 | 4/9/16 | on grass | Day Game | Detroit Tigers | 1 | 7 | 4 | 1:10 p.m. Local | Comerica Park | Saturday | 31 | 18 | from Left to Right | Cloudy | 12 | 3.333333 | regular season | 0 | 1 | Loss | TRUE |
| 2412 | 20192 | Cleveland Indians | 3 | 7 | 3 | 4/9/16 | on grass | Day Game | Chicago White Sox | 1 | 10 | 7 | 1:10 p.m. Local | U.S. Cellular Field | Saturday | 32 | 13 | in from Rightfield | Sunny | 10 | 2.716667 | regular season | 1 | 0 | Win | TRUE |
| 2445 | 34493 | Boston Red Sox | 0 | 11 | 6 | 4/5/16 | on grass | Day Game | Cleveland Indians | 1 | 5 | 2 | 1:11 p.m. Local | Progressive Field | Tuesday | 34 | 8 | from Left to Right | Sunny | 8 | 3.216667 | regular season | 0 | 1 | Loss | TRUE |
| 10 | 47820 | Houston Astros | 0 | 6 | 5 | 4/5/16 | on grass | Day Game | New York Yankees | 1 | 4 | 3 | 1:11 p.m. Local | Yankee Stadium III | Tuesday | 36 | 18 | from Left to Right | Sunny | 8 | 3.283333 | regular season | 0 | 1 | Loss | TRUE |
| 2411 | 22799 | Pittsburgh Pirates | 1 | 10 | 1 | 4/9/16 | on grass | Day Game | Cincinnati Reds | 1 | 8 | 5 | 1:11 p.m. Local | Great American Ball Park | Saturday | 38 | 19 | from Left to Right | Cloudy | 6 | 3.000000 | regular season | 1 | 0 | Win | TRUE |
I decided to focus my case study on these five teams, with the exception of the New York Yankees. Despite them having the fourth coldest temperature in April, it didn’t make sense to analyze only one of the teams in Chicago. Because of this,I chose to replace the Yankees with the Chicago Cubs. The code below filters out all of the night games (within the dataset) that these five teams played in the month of April.
baseball_reference_2016_clean %>%
filter(`home_team` %in% c("Chicago Cubs", "Chicago White Sox", "Detroit Tigers", "Cleveland Indians", "Cincinnati Reds") & april %in% TRUE & `game_type` %in% "Night Game") -> Nighttop5
Now that I have a clear picture of the games that these teams played in April, I can plot these data points and see if there is any correlation.
attach(Nighttop5)
## The following objects are masked from baseball_reference_2016_clean (pos = 3):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 4):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
plot(total_runs, temperature, main="Total Runs vs. Temperature",
xlab="Total Runs", ylab="Temperature", pch=10)
abline(lm(total_runs~temperature), col="red")
The plot above shows a slight positive correlation between total runs and temperature. There are a few outliers within this filtered dataset. There are three games where less than 10 total runs were scored, while the temperature was between 70 and 80 degrees. This could be because there were two very good pitchers throwing who shut down the opposing offenses. Outside of these three points, you can see that as the temperature gets warmer, slightly more runs are scored.
The code below represents the scatterplot for total runs vs. wind speed within the filtered dataset.
attach(Nighttop5)
## The following objects are masked from Nighttop5 (pos = 3):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 4):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 5):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
plot(total_runs, wind_speed, main="Total Runs vs. Wind Speed",
xlab="Total Runs", ylab="Wind Speed", pch=10)
abline(lm(total_runs~temperature), col="red")
The plot above shows a much less clear correlation between the two variables. The data points are scattered all over the plot and no real correlation can be derived. The next part of my analysis looks at the same filtered for these five teams. However, instead of focusing on the night games, I want to take a look at the day games and see if they prove or dis-prove the correlations from the night games.
The code below filters out the five teams and the day games from the original dataset.
baseball_reference_2016_clean %>%
filter(`home_team` %in% c("Chicago Cubs", "Chicago White Sox", "Detroit Tigers", "Cleveland Indians", "Cincinnati Reds") & april %in% TRUE & `game_type` %in% "Day Game") -> Daytop5
Now that the data has been filtered, I will create similar scatter plots to the ones that analyze the night games.
attach(Daytop5)
## The following objects are masked from Nighttop5 (pos = 3):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
## The following objects are masked from Nighttop5 (pos = 4):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 5):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 6):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
plot(total_runs, temperature, main="Total Runs vs. Temperature",
xlab="Total Runs", ylab="Temperature", pch=10)
abline(lm(total_runs~temperature), col="red")
Similarly to the plot on the night games, this plot shows a slight positive correlation between total runs and temperature. Like the other plot, there are some outliers. However, baseball is a sport with a lot of unpredictable factors. So for the majority of these data points, there is a positive correlation between these variables.
attach(Daytop5)
## The following objects are masked from Daytop5 (pos = 3):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
## The following objects are masked from Nighttop5 (pos = 4):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
## The following objects are masked from Nighttop5 (pos = 5):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 6):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
## The following objects are masked from baseball_reference_2016_clean (pos = 7):
##
## april, attendance, away_team, away_team_errors, away_team_hits,
## away_team_runs, date, day_of_week, field_type, game_hours_dec,
## game_type, home_team, home_team_errors, home_team_hits,
## home_team_loss, home_team_outcome, home_team_runs, home_team_win,
## season, sky, start_time, temperature, total_runs, venue,
## wind_direction, wind_speed, X1
plot(total_runs, wind_speed, main="Total Runs vs. Wind Speed",
xlab="Total Runs", ylab="Wind Speed", pch=10)
abline(lm(total_runs~temperature), col="red")
The lack of correlation for the night games definitely stayed consistent with the day games. Overall, it appears as though wind speed does not play role in influencing the total runs scored in a baseball game. The opposite can be said for temperature. With the exception of a few games where the starting pitchers were exceptional, it appears as though the warmer the temperature is, the higher the likelihood that more runs will be scored.
The game of baseball is at a crossroads. As the NFL and NBA continue to gain popularity, Major League Baseball is at risk of losing a lot of viewers. The majority of people who watch baseball hope to see a high scoring game with lots of hits. The outcome of this study showing that total runs scored and temperature have a positive correlation could be useful information to the decision makers of MLB. Perhaps in order to gain viewers, MLB could push back the start or shorten their season to ensure more runs are scored.