Baseball Analysis using Sabermetrics

Introduction

A statistical analysis of Baseball involves measuring the individual, team, and league performance based on two broad categories: offensive and defensive performance. The recording of play-by-play performance commonly refers to each event that occurs within one game between teams. The statistical summary of each game is recorded in a box score, which indicates various quantitative measures such as runs scored, the number of hits, strikes, walks, outs, errors, or balls put into play. In addition, a box score records various categorical metrics such as a team’s lineup - or the sequential order of players - the position of players, the names of umpires and managers present, or the game’s date to name a few. The crux of a box score is to record objective data by giving logical values (0 for false, and 1 for true) to each event that occurs and does not occur within a game. In other words, if Noah Syndergaard, a starting pitcher for the New York Mets, were to strikeout a batter in the first inning of a game he would be accreddited one “out” in the first inning. An aggregration of these scores is what begins this report to identify not only career leaders based on certain statistical analyses, but also to create a team of statistically powerful players that are currently active - not deceased nor retired - within the Major League and to forecast their possible performance.

In 1985, Bill James pioneered the empirical analysis of Baseball known as Sabermetrics in order to accurately measure player performance that contributes to a team’s win or loss. Unlike Sabermetrics, conventional statistics - which I explain both further on in this report - typically do not factor out variables that cannot logically be traced back to certain players or the position a player is in. Luck, for example, is not exactly an objective measure that can be easily quantified or logically supported without some subjectivity. However, in many convential statistics, luck is usually accounted for. To do away with such subjectivity that misleads one to believe a player is better or worse than he actually is, James and other Sabermetricians derived serval, improved formulas to accurately portray a player’s performance. This report breaks down the categories of offensive and defensive performance into a player’s ability to bat, pitch and field balls put into play to either earn or prevent runs from being scored using these Sabermetrics.

Discussion of Technologies, Datasets, and Literature

The database where I derived play-by-play data is an amalgam of box scores collected from 1952 to 2015 - the lastest season to be documented by Retrosheet Inc., an open-source organization dedicated to the collection of game accounts, to the unification of such accounts into a user-friendly system, and to the recording of such a system into a computerized format. I restricted my analysis to the years of 1976 to 2015, a 40-year time span of which was imported from the open-source relational database management software, MySQL. In conjuction to the Retrosheet database, I imported certain data sets related to all players, Pitchers and Fielders from the Lahman Database, another open-source online archive of summarized baseball statistics to assist in calculating statistics that would have otherwise been faulted with human error.

Also, with the guide of Joseph Adler’s Baseball Hacks: Tips and Tools for Analyzing and Winning with Statistics, I was able to derive the most up-to-date formulas used to measure batting, fielding, and pitching statistics in addition to defining variables that are commonly abbreviated in the industry. The Sabermetric statistics used in this report are categorized by batting, pitching and fielding performances, and will be discussed in this order.

Data Analysis

Batting Statistics

Conventional statistics like a batter’s Batting Average is commonly used to explain how well a player hits a ball per the number of times he is up at bat. This simiplistic equation cannot, however, be used to compare the batting average of a seasoned player whose had a larger number of at bats to an less seasoned player whose has number of at bats is significantly shorter. Usually, Batting Averages are documented along with a player’s number of at bats to allow fans to gauge players slightly more effectively, but it does not prove to be statistically significant. Players who “qualify for titles such as the Highest Batting Average,” Major League Baseball requires that a “batter has on average 3.1 at bats per game,” that is about “500 at bats” (ADLER, 333).

##   0%  25%  50%  75% 100% 
##    0  272  438  546  716
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Limitation:

At first glance of the current distribution of at bats, the data is more skewed to the left accounting for significant amount of seasoned players compared to rookies in the retrosheet database. Therefore, following Major League Baseball’s standard, I’ve filtered the total dataset to include only batters whose total at bat is greater than or equal to 500. Although Batting Average is not the most reliable statistic to measure a batter’s performance, it is easier to understand than most. Filtering out players whose total number of at bats is below 500 narrows the data set to include players who have faced more similar situations during their career, and thus lessens the margin of error among seasoned and less-seasoned players.

The idea of Sabermetrics is to capture an accurate estimate of a player’s performance. For Batters, the On-Base Percentage (OBP), Slugging (SLG), On-Base Plus Slugging (OPS), Isolated Power (ISO), Runs Created (RC), and Batting Average on Balls in Play (BABIP) are prominent measurements.

## 'data.frame':    3330 obs. of  31 variables:
##  $ BAT_ID     : chr  "abreb001" "abreb001" "abreb001" "abreb001" ...
##  $ YEAR_ID    : Date, format: "2002-01-01" "2010-01-01" ...
##  $ BAT_HAND_CD: Factor w/ 3 levels "Unknown","Left",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ firstBase  : num  100 84 110 118 117 105 117 117 106 88 ...
##  $ secondBase : num  50 41 40 29 35 41 39 35 37 30 ...
##  $ thirdBase  : num  6 1 5 3 1 2 4 11 1 1 ...
##  $ HR         : num  20 20 16 15 20 15 20 20 24 8 ...
##  $ K          : num  117 132 115 113 126 138 109 113 134 113 ...
##  $ H          : num  176 146 171 165 173 163 180 183 168 127 ...
##  $ AB         : num  572 573 605 563 577 548 609 546 588 502 ...
##  $ BB         : num  95 84 84 87 96 118 71 101 102 73 ...
##  $ CS         : num  6 6 3 10 0 3 3 2 2 9 ...
##  $ GIDP       : num  12 16 17 20 15 17 18 14 7 11 ...
##  $ HBP        : num  3 2 3 1 2 3 1 3 6 1 ...
##  $ IBB        : num  9 3 0 7 13 6 2 8 15 5 ...
##  $ PA         : num  686 669 702 672 696 687 685 664 720 589 ...
##  $ RBI        : num  85 78 101 103 101 107 100 93 102 60 ...
##  $ SF         : num  6 5 7 9 7 9 1 4 8 3 ...
##  $ SH         : num  0 0 0 0 0 2 0 0 0 1 ...
##  $ SB         : num  19 15 10 20 9 23 12 14 23 13 ...
##  $ TB         : num  298 249 269 245 270 253 287 300 279 183 ...
##  $ Avg        : num  0.308 0.255 0.283 0.293 0.3 0.297 0.296 0.335 0.286 0.253 ...
##  $ OBP        : num  40.5 34.9 36.9 38.3 39.7 ...
##  $ SLG        : num  0.521 0.435 0.445 0.435 0.468 0.462 0.471 0.549 0.474 0.365 ...
##  $ OPS        : num  0.926 0.784 0.814 0.818 0.865 0.881 0.84 0.988 0.866 0.712 ...
##  $ ISO        : num  0.213 0.18 0.162 0.142 0.168 ...
##  $ RC         : num  121.6 85.6 99.3 89.9 109.7 ...
##  $ BABIP      : num  0.354 0.296 0.322 0.338 0.349 0.366 0.333 0.391 0.329 0.31 ...
##  $ FIRST      : chr  "Bobby" "Bobby" "Bobby" "Bobby" ...
##  $ LAST       : chr  "Abreu" "Abreu" "Abreu" "Abreu" ...
##  $ DEBUT      : Date, format: "1996-09-01" "1996-09-01" ...

Bill James introduced Runs Created to estimate the average number of runs an individual player contributed to his team. By aggregating a batter’s number of at bats where a batter hits a ball, gains a walk to first base, is hit by a pitch, is caught stealing a base, is caught in a double play by the defensive team, successfully steals a base, or hits a sacrifice ball, a clearer image of the number of runs a batter has created over his career can be distinguised.

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning: 'scatter' objects don't have these attributes: 'RC'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'hoverinfo', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'mode', 'hoveron', 'line', 'connectgaps', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xsrc', 'ysrc', 'textsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key'

A batter’s On Base Percentage is a measure of the rate at which a batter reaches a base successfully, excluding fielding errors, fielder’s choices, fielder’s obstruction or catcher’s interference. Batters with high OBP’s are considered leadoff Batters - batters who are at the top of a team’s lineup - because they have the ability to consistently get on a base on their own and be in scoring position compared to batters will lower OBPs’s. The above scatter plot visualizes the positive relationship between On Base Percentage and the number of Runs Created for no particular player. As you can see, the higher a player’s On Base Percentage, the greater the chance of his reaching a base, the greater number of runs he is able to contribute.

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning: 'scatter' objects don't have these attributes: 'RC'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'hoverinfo', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'mode', 'hoveron', 'line', 'connectgaps', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xsrc', 'ysrc', 'textsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key'

Slugging and On-Base Plus Slugging measures a batter’s power. Usually, a power hitter earns more extra number of bases, which is to say that player with higher Slugging or On Base Plus Slugging percentages have a higher chance of gaining doubles, triples and home runs. Whereas, On Base Percentage measures a batter’s ability to get on first the base equally to his ability to get onto second base, third base, or to make a home run, Slugging and On Base Plus Slugging puts separates weigths on each event. Therefore, a batter who hits a ball and makes it to first base is accredited one hit. Another batter who hits a ball and makes it to second base is accredited twice the amount of hits as the first batsmen. Yet, another batter who hits a ball and makes it to third base is accredited three times the amount of hits as the first bastmen. Lastly, a batter who hits a homerun is accredited four times the amount of hits as the first batsmen.

Similarly, the number of Runs Created increases as a batter’s Slugging power increases.

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning: 'scatter' objects don't have these attributes: 'RC'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'hoverinfo', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'mode', 'hoveron', 'line', 'connectgaps', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xsrc', 'ysrc', 'textsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key'

A batter’s On-Base Plus Slugging is also directly proportional to the number of Runs he creates.

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning: 'scatter' objects don't have these attributes: 'RC'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'hoverinfo', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'mode', 'hoveron', 'line', 'connectgaps', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xsrc', 'ysrc', 'textsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key'

Isolated Power similarly measures a batter’s power, but generally a larger sample size is needed to caluclate an accurate Isolated Power ratio. A batter’s Isolated Power is also a strong proponent in determining the number of runs he creates.

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
## Warning: 'scatter' objects don't have these attributes: 'RC'
## Valid attributes include:
## 'type', 'visible', 'showlegend', 'legendgroup', 'opacity', 'name', 'uid', 'hoverinfo', 'stream', 'x', 'x0', 'dx', 'y', 'y0', 'dy', 'text', 'mode', 'hoveron', 'line', 'connectgaps', 'fill', 'fillcolor', 'marker', 'textposition', 'textfont', 'r', 't', 'error_y', 'error_x', 'xaxis', 'yaxis', 'xsrc', 'ysrc', 'textsrc', 'textpositionsrc', 'rsrc', 'tsrc', 'key'

Lastly, the Batting Average of Balls in Play is a statistical average used for both batters and pitchers. For batters, in particular, the Batting Average of Balls in Play gauges how well a batter can hit a ball safely into play. With that in mind, balls that are hit passed the park’s foul line - also known as foul balls - and balls hit out of the park - also known as home runs - are excluded.

Linear Regression

## 
## Call:
## lm(formula = RC ~ BABIP + OBP + SLG + OPS + ISO + BAT_HAND_CD, 
##     data = Batting.active)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.4757  -4.6607  -0.1896   4.4377  24.8266 
## 
## Coefficients:
##                   Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)      -105.4708     1.4139 -74.597 < 0.0000000000000002 ***
## BABIP              13.7978     6.5879   2.094              0.03630 *  
## OBP                 2.3580     4.0335   0.585              0.55885    
## SLG               125.3305   403.3312   0.311              0.75602    
## OPS                61.0841   403.2334   0.151              0.87960    
## ISO                 3.2862     8.8716   0.370              0.71109    
## BAT_HAND_CDRight   -0.7232     0.2521  -2.868              0.00416 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.807 on 3323 degrees of freedom
## Multiple R-squared:  0.9093, Adjusted R-squared:  0.9091 
## F-statistic:  5550 on 6 and 3323 DF,  p-value: < 0.00000000000000022