Introduction

Ever since high school, I dreamed of being a baseball analyst. Movies like “Moneyball” gave me insight as to how important data is in Major League Baseball is in terms of putting together the best possible team while remaining within the budget. When it is all said and done, analysts that manipulate baseball variables the most effectively are able to field the rosters that generate the most wins, allowing these baseball clubs to ultimately compete in the playoffs and potentially win the World Series. In this article, I want to analyze which important baseball variables have strongest relationships with the number of wins for a baseball club in the MLB.

I will be using two datasets to perform my analysis: a structured 2012 batting statistics dataset from Kaggle and an unstructured 2012 pitching statistics dataset from ESPN.

2012 Batting Statistics

library(tidyverse)
library(lubridate)
library(DT)
library(knitr)
library(XML)        
library(httr)       
library(rvest)

After the Data Preparation Stage, I’m left with 30 observations for 15 variables. Featured below, the results can be seen in a user-friendly datatable that an individual can manipulate. The batting statistics data is stored in a table where each observational row represents a single team’s statistics for the year 2012. The variables in the dataset include:

-Team -League (National League or American League) -Year -RS (Total Runs Scored in 2012) -RA (Total Runs allowed in 2012) -W (Total Wins in 2012) -OBP (On Base Percentage) -SLG (Slugging Percentage) -BA (Batting Average) -Playoffs (0 = no, 1 = yes) -RankSeason (Rank heading into the playoffs) -RankPlayoffs (Rank after completing of the playoffs) -G (Total Games played in 2012) -OOBP (Opponent On Base Percentage) -OSLG (Opponent Slugging Percentage)

Summarized Batting Results in A Data Table

Summary Information on Variables of Interest

I personally wanted to see the average breakdown for Total Wins, On Base Percentage, and Slugging Percentage. Lastly, I wanted to see the difference between the maximum amount of wins and the minimum amount of wins.

Avg_Wins Avg_OBP Avg_SLG Avg_BA Most_Wins Least_Wins
81 0.319 0.4052 0.2544 98 55

The average amount of wins was 81. Theoretically this makes sense because 81 is half of the total amount of games played (162 in a season), and since the number of wins for each of the 30 teams is recorded in this dataset, the average would have to be exactly half (since any given team can only win or lose each game). The average On Base Percentage, .319, is considered fair in terms of typical baseball standards (.300 OBP considered to be average). The average Slugging Percentage, .405, is around the usual average for MLB standards A slugging percentage of .450 is considered to be good for MLB standards. The overall average for batting average, .254, is also considered to be average. Having a batting average above .300 is considered a good batting average. Lastly, the variance between the highest amount of wins (98) and the least amount of wins (55) is pretty substantial. All in all, I was expecting the overall average for batting average to be more close to .300 and I was expecting the minimum amount of wins to be within the 60s range.

Batting Statistics Analysis

For this part of my analysis in this article, I would like to address this question:

Which batting variables, from the year 2012, influenced the amount of wins the most in Major League Baseball?

I decided to utilize 5 scatterplots to perform my analysis to answer this question. For the scatterplots, I assessed the relationship between the total amount of wins with the Batting Average, Slugging Percentage, On Base Percentage, Total Runs Scored, and Total Runs allowed variables to see which variables had the most significant relationships. For each scatterplot, I utilized a ggplot scatterplot, setting the total amount of wins as my Y variable and one of the other 5 variables listed above as my X variable. I also implemented geom_smooth to make the scatterplots more aesthetically pleasing and to better access the relationship between the two variables in each of the scatterplots.

Analysis

Results: Going back to what I was expecting from the summary statistics above, I was quite surprised when I discovered that batting average and the total amount of wins did not have a significant positive relationship. There were a number of teams that had below 80 wins that had high batting averages. There is a respectable positive relationship between slugging percentage and the total amount of wins. THere were a number of teams that performed extremely well, winning around 90 games, but their slugging percentages were relatively average around 0.400. There is also a reputable positive relationship between on base percentage and the total amount of wins. Compared, to slugging percentage, it appears high performing teams tended to have above-aveerage to stellar on base percentages. Next, there was a significant positive relationship between total runs scored and the total amount of wins. This makes sense, considering a team has to score a lot of runs to consistently win games, but there were a number of outliers that finished off the season with an average amount of runs scored (around 700) and ended up winning a lot of games (at or neear 90 wins). Lastly, there is a very strong, significant negative relationship between total wins allowed and the total amount of wins. All in all, the total wins allowed variable appears to have the most significant relationship with the total amount of wins. One can see that almost all of high performing teams gave up the fewest amount of runs. Since total runs allowed is considered to be a pitching statistic, I decided to look further into this discovery by performing analysis on my 2012 pitching dataset from ESPN.

2012 Pitching Statistics

After the Data Preparation Stage, I’m left with 30 observations for 18 variables. Featured below, the results can be seen in a user-friendly datatable that an individual can manipulate. The pitching statistics data is stored in a table where each observational row represents a single team’s statistics for the year 2012. The variables in the dataset include:

-RK (Rank in terms of ERA) -Team -GP (Games Played) -W (Wins) -L (Losses) -ERA (Earned Run Average) -SV (Total Pitching Saves) -CG (Total Completed Games With one Pitcher) -SHO (Total Shutouts) -QS (Total Quality Starts) -IP (Total Innings Pitched) -H (Total Hits Allowed) -ER (Total Earned Runs Allowed) -HR (Total Home Runs Allowed) -BB (Total walks allowed) -SO (Total Strikeouts) -OBA (On Base Percentage Allowed) -WHIP (Walks Plus Hits Per Inning Pitched)

Summarized Results in A Data Table

2012 Pitching Statistics Analysis

For this part of my analysis in this article, wanted to take a look at a number of important pitching variables, notably Earned Run Average, Total Strikeouts, WHIP, Total Homeround Allowed, and Total Saves. I selected these variables to look further into the idea above that pitching variables, like Total Runs Allowed from above, may have stronger relationships with Total Wins Allowed than batting variables.

I decided to utilize 5 scatterplots to perform my analysis once again. For the scatterplots, I assessed the relationship between the total amount of wins with the five pitching variables to see which variables had the most significant relationships. For each scatterplot, I utilized a ggplot scatterplot, setting the total amount of wins as my Y variable and one of the other 5 variables as my X variable. I also implemented geom_smooth once again.

Analysis

Results: There is strong negative relationship between earned run average and the total amount of wins. Almost all of the high performing teams (around 90 wins) had an ERA of 4.00 or lower. There is a respectable relationship between total strikeouts and the total amount of wins, however, there a were a few teams that had below 90 wins and still struck out at or over 1200 batters. Also, there is another strong negative relationship between WHIP and total amount of wins. There appears to be no high performing outliers (90+ win teams) with high WHIPs. Next, there is an adequate negative relationship between total homeruns allowed and the total amount of wins. There were three high performing outliers (90+ wins) that allowed around 180 total home runs. Lastly, there was a fairly strong realtionship between total saves and the total amount of wins. 8 teams with 90+ wins finished with 50 or more total saves. All in all, the Earned Run Average, WHIP, and Total Saves variables appear to have the strongest relationships with the total amount of wins. Looking back at my discovery from my first analysis, these three are pertinent variables in determining the total amount of runs allowed for each team. In order to back up this idea and discover if total runs allowed has the most significant relationship with the total amount of wins, I decided to create a regression with total wins allowed as my dependent variable, and the original five variables (including total runs allowed) from my batting statistics datset to test my theory.

Regression Model

Once again, I decided to create a regression model to investigate if the total runs allowed variable has the most significant relationship with the total amount of wins.

I created a new dataframe from the 2012 batting statistics dataset, in which my included variables are as follows:

-W (Wins) -BA_norm (Batting Average Normalized) -SLG_norm (Slugging Percentage Normalized) -OBP_norm (On Base Percentage Normalized) -RS_norm (Total Runs Scored Normalized) -RA_norm (Total Runs Allowed Normalized)

As you can see, I included the Total Amount of Wins in this new dataset, but I went ahead and normalized the independent variables (the five variables I used in my original 2012 batting statistics analysis) in order to reduce any potential collinearity in the model. Also, I chose not to take out any outliers since the regression model only includes statistics from the year 2012, meaning that there are only 30 rows (one for each team.)

## 
## Call:
## lm(formula = norm_reg_df$W ~ norm_reg_df$BA_norm + norm_reg_df$SLG_norm + 
##     norm_reg_df$OBP_norm + norm_reg_df$RS_norm + norm_reg_df$RA_norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8963 -2.4916 -0.3297  1.1417 11.3799 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            75.919     33.740   2.250   0.0339 *  
## norm_reg_df$BA_norm    -3.652     41.425  -0.088   0.9305    
## norm_reg_df$SLG_norm   17.400     41.727   0.417   0.6804    
## norm_reg_df$OBP_norm   16.205     58.354   0.278   0.7836    
## norm_reg_df$RS_norm    68.507     33.500   2.045   0.0520 .  
## norm_reg_df$RA_norm  -103.899      9.024 -11.513 2.93e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.117 on 24 degrees of freedom
## Multiple R-squared:  0.9015, Adjusted R-squared:  0.881 
## F-statistic: 43.94 on 5 and 24 DF,  p-value: 2.617e-11

Results of Regression:

From the regression model, one can see that the Total Runs Allowed Normalized variable was the most significant with a p-value of 2.617e-11. Even going off of a 0.05 level of significance for this model, the Total Runs Allowed Normalized variable would still be the only significant variable (the next most significant variable is the Total Runs Scored Normalized variable with a p-value of 0.052).

All in all, based off the results from this regression, there seems to be some truth in my theory that the Total Runs Allowed variable has the strongest relationship with the Total Amount of Wins.