Introduction

The data I decided to work with in this project is a dataset of all games played in the MLB (Major League Baseball) from the years 1871-2016 collected by Retrosheet. This dataset contains a whopping 161 columns of data and 171,000 games recorded. This data contains stats about the teams that played in the games, split up by the visiting team and the home team, such as categorical data about what league they are from, the player names and positions, the starting pitcher names, and more. The numerical data about each team included the score they ended up with, the number of hits, at bats, home runs, single, doubles, triples, double-plays, sacrifices, and much more. There is also some categorical data on when the game took place, which are the date, the day of the week, whether or not the game took place at day or at night, and the names of the umpires that officiated the games. Some quantitative data about the game itself included the number of people that attended the game and the length of time the game took in both minutes and outs.

Background Research on Retrosheet

Retrosheet conducts there research in a very distinct way. Firstly, the data that they get in order to compile information on a large number of baseball games for over a century is done by consulting with the baseball organizations that took part in said games, in order to obtain any data that they have on the games. They also take information from 3rd party baseball fans who have recorded the game data themselves while watching the game, a fun pastime that many baseball watchers do as a fun activity to supplement watching the game itself. This is done to fill in any data gaps that the teams themselves did not have. This process have given them access to data from a large number of games starting in 1871 to the present, but has actually allowed them to have fully record all data of EVERY single game played in the MLB since 1971 to the present. Once they have obtained the data, they the uniformly format it all since the way that eahc individual team or third party formats the data is wildly different, and they then convert all of this data usually given on paper to computer format to put up for access on their free website.

This is the link to website that I conducted my research with: click here.

Loading in the data and packages
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## âś” ggplot2 3.4.0     âś” purrr   1.0.1
## âś” tibble  3.2.1     âś” dplyr   1.1.1
## âś” tidyr   1.3.0     âś” stringr 1.5.0
## âś” readr   2.1.3     âś” forcats 1.0.0
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'dplyr' was built under R version 4.2.3
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(RColorBrewer)
game_logs <- read_csv("game_logs.csv")
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
## Rows: 171907 Columns: 161
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (75): day_of_week, v_name, v_league, h_name, h_league, day_night, protes...
## dbl (83): date, number_of_game, v_game_number, h_game_number, v_score, h_sco...
## lgl  (3): completion, forefeit, rf_umpire_id
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Here I loaded in the necessary packages and the data set for this assignment.

Cleaning the data
game_logs_data <- game_logs %>%
  select(day_of_week, v_name, v_league, h_name, h_league, v_score, h_score, length_outs, day_night, attendance, length_minutes, v_at_bats, v_hits, v_homeruns, h_at_bats, h_hits, h_homeruns)

Since this data set has a very large amount of columns to work with, I selected only the ones in which I found to be interesting enough to consider working with on this assignment.

 game_logs_data <- game_logs_data %>%
  filter(game_logs_data$v_league != "na" & game_logs_data$h_league != "na" & !is.na(game_logs_data$attendance) & !is.na(game_logs_data$length_outs) & !is.na(game_logs_data$day_night) & !is.na(game_logs_data$length_minutes) & !is.na(game_logs_data$v_at_bats) & !is.na(game_logs_data$v_hits) & !is.na(game_logs_data$v_homeruns) & !is.na(game_logs_data$h_at_bats) & !is.na(game_logs_data$h_hits) & !is.na(game_logs_data$h_homeruns))

The columns I selected had many missing values in the early years when data recording was not as nicely kept, so I decided to filter the data to remove all games that had missing information by either removing NA values or “na” categorical values. ### Statistical Analysis

ggplot(game_logs_data, aes(x = day_of_week, y = attendance)) +
  geom_boxplot(fill = "steelblue", color = "white", width = 0.7) +
  labs(title = "Baseball Game Attendance by Day of the Week",
       x = "Day of the Week",
       y = "Attendance") +
  theme_minimal() +
  theme(legend.position = "none") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

For this side-by-side box plot I wanted to determine if there is a significant difference in attendance between games that took place on each day of the week. It is very clear here that Friday and the weekend has higher average attendance than the other 4 weekdays, but I was surprised to find out that Thursday seems to be the least popular day to attend a baseball game.

correlation <- cor(game_logs_data$v_hits, game_logs_data$h_hits)
cat("Correlation between Visitor Hits and Home Hits:", correlation)
## Correlation between Visitor Hits and Home Hits: 0.1304966

Here I wanted to see if the number of hits one team gets has any correlation with the hits that another team gets, which based on the result there is not a very strong correlation between the two. This makes sense because the way one team performs at hitting should have little to no impact on how the other team performs.

Simple Plots

ggplot(game_logs_data, aes(x = v_hits, y = h_hits)) +
  geom_point() +
  labs(title = "Visitor Hits vs. Home Hits",
       x = "Visitor Hits",
       y = "Home Hits") +
  theme_light()

I then created a simple plot to visualize this correlation, and it is very clear that there is almost 0 relationship between these two variables.

ggplot(game_logs_data, aes(x = v_homeruns, y = h_homeruns)) +
  geom_point() +
  labs(title = "Visitor Homeruns vs. Home Homeruns",
       x = "Visitor Homeruns",
       y = "Home Homeruns") +
  theme_bw()

Here I created a similar plot to determine if home runs hit by each team had any relationship, and this likewise seems to have ltitle to no correlation between the two variables.

ggplot(game_logs_data, aes(x = attendance, fill = h_league)) +
  geom_histogram(binwidth = 1000, position = "dodge") +  
  labs(title = "Attendance by Home League",  
       x = "Attendance",  
       y = "Frequency") +  
  scale_fill_discrete(name = "Home League")  

For this simple visualization I created a histogram to look at the distribution of attendance, and I wanted to see if the League in which the game was played had an effect on attendance numbers, so I decided to fill by the home team’s league because the game takes place in the home team’s stadium. As I expected, the attendances for both American and National League teams are similar, but Florida League teams seem to have far less. This is because the Florida League is used only for preseason games for Major League Baseball, meaning the teams are only playing to get ready for thew regular season and the games don’t mean anything in terms of stakes, so it makes sense that less people would attend these games.

Final Plot

game_logs_data <- game_logs_data %>%
  mutate(score_diff = h_score - v_score,
         hits_diff = h_hits - v_hits)

For my final plot, I wanted to make a scatter plot comparing the difference between the amount of hits each team got to the score difference at the end of the game between each team. To do this I created two new variables called score_diff and hits_diff to use for my visualization.

plotly::plot_ly(data = game_logs_data, 
                x = ~hits_diff, 
                y = ~score_diff, 
                size = ~attendance, 
                color = ~day_of_week, 
                colors = "Set3") %>%
  plotly::add_markers(text = ~paste("Attendance:", attendance, "<br>",
                                    "Hits Differential:", hits_diff, "<br>",
                                    "Score Differential:", score_diff), 
                      hoverinfo = "text") %>%
  plotly::layout(
    title = "Interactive Scatter Plot of Score Differential vs Hits Differential",
    xaxis = list(title = "Hit Differential"),
    yaxis = list(title = "Score Differential"),
    showlegend = TRUE,
    legend = list(title = "Day of Week"),
    margin = list(l = 50, r = 50, b = 50, t = 50)
  )
## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

## Warning: `line.width` does not currently support multiple values.

Statistical Analysis of the Final Plot and Visulization Explanation/Analysis

correlation2 <- cor(game_logs_data$hits_diff, game_logs_data$score_diff)
cat("Correlation coefficient between hits_diff and score_diff:", correlation2)
## Correlation coefficient between hits_diff and score_diff: 0.7376067

For my final visualization I decided to create a scatter plot to compare the difference in number of hits each team got through a game to the difference in the number of runs they ended up scoring. This way I would be able to show if there is a strong relationship between getting more hits and scoring more runs than your opponent in baseball, which logically would make sense. I also decided size each point on the graph by the attendance of each game and color each point by the day of the week in which the game took place in order to see if either of these variables showed any patterns. I used plotly to display on each point the run differential, score differential, and attendance when hovered over. As I predicted before making this visualization, there is indeed a strong relationship between the run differential and the score differential, meaning that the team that ended up getting more hits, whether it was the home team on the positive side of the y axis or the visiting team on the negative side, ended up scoring more runs much more often than not. What surprised me about this visualization is that the correlation wasn’t even stronger, considering that my whole life growing up with baseball the main thing coaches said was that generating more hits would lead to winning more games. But there are many game on this graph where one team had quite a few more hits than the other team did, yet the team with fewer hits ended uop winning the game. This shows that while hits may correlated strongly with more runs scored and by proxy more game won, hits will are not the only factor that contributes to scoring more runs than your opponent in a baseball game.