The data that I gathered was all of the hitting statistics from the MLB during the 2021 regular season. I chose this data because of my love for baseball and my curiosity to see what hitting statistics can be used to predict other stats, and to find the relationships between different variables. I have provided a link to the website I scraped the data from below.
A team wins a baseball game by scoring more runs than the opposing team. Because of this, the question I will look into is to find what hitting stats (if any) have a relationship to the number of runs a player scores. If we are able to determine what statistics best relate to runs scored, then we will have a better idea of what players are the most valuable to a baseball team’s offense (obviously defense plays a large role but for this project I will be focusing only on offense).
To answer this question, I will be specifically looking at on base percentage (OBP), batting average (AVG), and slugging average (SLG), and determining if there is a relationship that I can model between those variables and runs scored. I will first look at simple linear regression models. I will create a linear regression model using each of my three independent variables vs my single dependent variable. This will allow me to visualize the relationships and determine which variables are the best predictors of runs scored.
Simple linear regressions and multivariable linear regression models are appropriate to answer my research question. I have three independent variables that I will be using, ‘OBP’, ‘AVG’, and ‘SLG’ and one dependent variable, ‘runs’. Since all of my variables are quantitative and not categorical, a linear regression model will be perfect to estimate these relationships.
After that I will then create a correlation coefficient matrix to determine which of my independent variables has the strongest relationship to runs scored. This matrix will tell us which of our linear regression models are best at predicting runs scored. This matrix will also allow me to determine if a multivariable linear regression model would be a better predictor than my three simple linear regressions models.
First, lets take a look at each of my independent variable so we can understand what they represent. On base percentage (OBP) refers to how frequently a hitter reaches a base per plate appearance. This includes hits and walks, but does not include times where the player reaches on an error or fielders choice. Batting average (AVG) refers to the frequency that a player will get a hit in a plate appearance (i.e. The number of hits divided by their number of at bats). This does not take walks into consideration. A players slugging average (SLG) measures the batting productivity of a hitter. This is calculated by taking the total number of bases (i.e. single = 1 base, homerun = 4 bases) divided by the number of at bats. I chose these three as my independent variables because they are based off hitting percentages rather than total numbers. For example, if we tried looking at stolen bases, hits, walk, or doubles, our linear regressions would be skewed because all of these stats will increase if a professional player gets more at bats. Whereas the three independent variables I chose are based off of percentages rather than total numbers.
First, I will factor out all of the players that had less than 50 at bats. This will help clean up my data by removing the players with limited stats. This helps reduce the number of outliers in each of our models.
# Filter by players who have at least 50 at bats.
mlb_data_factored <- mlb_data %>%
filter(AB >= 50)
Next, I will plot a histogram of each of my three independent variables. This will help us visualize each of our independent variables and gives us a better understanding of each graphs shape.
## Histogram for 'OBP'
ggplot(data = mlb_data_factored, aes(x = OBP)) +
geom_histogram(bins = 30) +
labs(x = "On Base Percentage (OBP)", y = NULL)
Here, we can see that the histogram for OBP is mostly uniform.
## Histogram for 'AVG'
ggplot(data = mlb_data_factored, aes(x = AVG)) +
geom_histogram(bins = 30) +
labs(x = "Batting Average (AVG)", y = NULL)
Here we can see that the histogram from AVG is slightly skewed left.
## Histogram for 'SLG'
ggplot(data = mlb_data_factored, aes(x = SLG)) +
geom_histogram(bins = 30) +
labs(x = "Slugging Average (SLG)", y = NULL)
Here we can see that the histogram for SLG is also mostly uniform.
Next, I will plot the histogram for my dependent variable ‘runs’.
## Histogram for 'runs'
ggplot(data = mlb_data_factored, aes(x = runs)) +
geom_histogram() +
labs(x = "Runs", y = NULL)
Here we can see that the histogram for my dependent variable ‘runs’ is skewed right. This means that the majority of players score between 0 and 80 runs, with some outliers out to 120 runs. This is what we expect to see because the few players for each team that play in the most games and hit at the top of the batting order will have the most at bats, increasing the number of chances they have to score a run.
The next step is to plot each of our independent variables against our dependent variable using a linear regression model.
First, I will plot ‘OBP’ as a function of ‘runs’.
ggplot(data = mlb_data_factored, aes(x = OBP, y = runs)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x) +
ylim(0, 120) +
labs(
title = "Runs scored vs OBP for players",
subtitle = "MLB Regular Season Hitting Data",
x = "OBP",
y = "Runs Scored"
)
As we can see from this linear model there is a positive relationship (slope) between OBP and runs scored. As a players on base percentage increases, so should the number of runs that that player scores. This makes sense because the more likely a player is to get on base, the more chances that player will have to score a run. However, we do see some outliers in this plot where we see a few players with high on base percentages and low runs scored. This could be due to that fact that some players do not play in very many games and therefore will not have the same number of opportunities to score even though their on base percentage is high.
Next, I will plot ‘AVG’ as a function of ‘runs’ using a linear model.
ggplot(data = mlb_data_factored, aes(x = AVG, y = runs)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x) +
ylim(0, 120) +
labs(
title = "Runs scored vs AVG for players",
subtitle = "MLB Regular Season Hitting Data",
x = "AVG",
y = "Runs Scored"
)
As we can see from this linear model there is a positive relationship (slope) between AVG and runs scored. This means that as a players batting average (AVG) increases, so should the number of runs that the player scores. This linear model also makes sense because as a players batting average increases, the more likely they are to get a hit is an at bat. The more likely they are to get a hit, the more likely they will be in a position to score.
Next, I will plot ‘SLG’ as a function of ‘runs’ using a linear model.
ggplot(data = mlb_data_factored, aes(x = SLG, y = runs)) +
geom_point() +
geom_smooth(method = "lm", formula = y ~ x) +
ylim(0, 120) +
labs(
title = "Runs scored vs SLG for players",
subtitle = "MLB Regular Season Hitting Data",
x = "SLG",
y = "Runs Scored"
)
As we can see this linear model also shows a positive slope between SLG and runs scored. This means that as the slugging percentage (SLG) increase for an individual player, generally so should the number of runs that that player scores. This makes sense because SLG represents the total number of bases a player records per at bat (i.e. hitting a home run increase SLG more than hitting a single). If a player averages more bases per at bat they are more likely to be in a position thats closer to home plate. This will allow them to be scored more frequently.
Now that I have created linear regression models for each of my different independent variables, I will now create a correlation matrix with my three independent variables (OBP, AVG, and SLG) and my dependent variable (runs). This will help us determine if creating a multivariable linear regression model would be useful. If each of our independent variables have low correlations to each other, a multivariable linear regression model would be beneficial. If the variables have high correlations to each other, then this would not be as useful. This matrix will also tell us which of our three independent variables is best at predicting the number of runs scored.
# Create a correlation matrix using our desired variables
cor(mlb_data_factored %>% select(runs, OBP, AVG, SLG))
## runs OBP AVG SLG
## runs 1.0000000 0.6102805 0.5894908 0.6634018
## OBP 0.6102805 1.0000000 0.8555947 0.7707268
## AVG 0.5894908 0.8555947 1.0000000 0.7931061
## SLG 0.6634018 0.7707268 0.7931061 1.0000000
This correlation matrix shows us the relationships between each of the four variables we have been using. As we can see, the relationships between OBP, AVG, and SLG all have correlation coeficients that are greater than 0.77. Since these values are high (close to 1), it means that the strength of the linear relationship between the variables is large. Because of this, we know that performing a multivariable linear model would not be as useful as our simple linear regressions that we performed earlier.
We can also use this correlation matrix to determine which of our previous linear regression models has the best fit and to determine which of our independent variables is the best at predicting the number of runs a player will score. As we can see above, the largest correlation coeficient between a independant variable and our dependent variable is 0.6634. This coeficient is between SLG and runs. This tells us that our of our three independent variables, Slugging percentage is the best at predicting the number of runs that a player will score.
I began this project trying to answer the question of what hitting statistics have the strongest relationships between each other and what statistic is best at predicting the number of runs a player will score. I was most interested in the variables OBP, AVG, and SLG as these stats are percentage based and will not increase based on number of at bats alone. By plotting my data using histograms and creating simple linear regression models using each of my independent variables as a function of runs, I have came to the conclusion that slugging percentage (SLG) was best at estimating the number of runs a player scores. I was able to verify this by creating a correlation coefficient matrix that measures how strong of a relationship each of my variables have to one another, and proved that SLG was the best fit model. From our matrix I was also able to determine that a multivariable linear regression model would not be helpful because our independent variables have too strong of a relationship to one other.
If I were to continue research on this topic there are a few things that I would implement. It would be interesting to see what the relationship is between where a player hits in the batting order (i.e. 1 - 9) and the number of runs they score. Most of the time (unless you hit a home run) your job as a hitter is to get as close to home plate possible, but it is often in the hands of the players behind you as to whether or not you will actually score. Someone with better players hitting directly after them may have an increase in the number of runs they will score. The second thing that I think would be interesting to test would be to see what the relationship is between the players’ team and the number of runs they will score. Some teams will be better than others and score more runs. It would be interesting to see how/if the team a player is on affects the number of runs they would be estimated to score. In order to do this we would forst need to scrape more data from the web that would tell us where a players average position in the lineup was, and second to convert our current team variable from a categorical variable into a quantative varibale. This would allow us to test the relationships between all of the teams. There are 30 total teams so I decided not to implement this technique into my project.