A brief investigation of the distribution of OPS and whether or not it is normally distributed within MLB players. See this discussion
Here we will look at 2012 data from the Lahman database. Based on http://baseballwithr.wordpress.com/2013/12/18/regression-of-ops-stats/
library(Lahman)
library(plyr)
library(car)
##
## Attaching package: 'car'
##
## The following object is masked from 'package:Lahman':
##
## Salaries
Batting.12 <- subset(Batting, yearID==2012)
# Collapse different stints for the same player
sum.function <- function(d){
d1 <- d[, 6:23] # RES: Should double check this
apply(d1, 2, sum)
}
Batting.12 <- ddply(Batting.12, .(playerID, yearID), sum.function)
# Compute OPS
Batting.12$X1B <- with(Batting.12, H - X2B - X3B - HR)
Batting.12$SLG <- with(Batting.12,
(X1B + 2 * X2B + 3 * X3B + 4 * HR) / AB)
Batting.12$OBP <- with(Batting.12,
(H + BB + HBP) / (AB + BB + HBP + SF))
Batting.12$OPS <- with(Batting.12, SLG + OBP)
First take a quick look at the OPS for all players.
ops <- Batting.12$OPS
summary(ops)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.3 0.6 0.5 0.7 2.0 342
boxplot(ops)
scatterplot(Batting.12$AB, ops)
head(Batting.12[order(ops, decreasing=TRUE),])
## playerID yearID G G_batting AB R H X2B X3B HR RBI SB CS BB SO IBB
## 516 hernada01 2012 72 NA 1 0 1 0 0 0 0 0 0 0 0 0
## 1008 rodrian01 2012 1 NA 1 0 1 0 0 0 0 0 0 0 0 0
## 789 millesh01 2012 6 NA 3 0 2 1 0 0 0 0 0 0 0 0
## 595 johnsda06 2012 14 NA 22 8 8 1 0 3 6 0 0 9 3 1
## 1009 rodried04 2012 2 NA 5 1 1 0 0 1 1 0 0 2 3 0
## 871 ortegra01 2012 2 NA 4 0 2 0 0 0 0 1 0 1 2 0
## HBP SH SF GIDP X1B SLG OBP OPS
## 516 0 0 0 0 1 1.0000 1.0000 2.000
## 1008 0 0 0 0 1 1.0000 1.0000 2.000
## 789 0 0 0 0 1 1.0000 0.6667 1.667
## 595 0 0 0 0 4 0.8182 0.5484 1.367
## 1009 0 0 0 0 0 0.8000 0.4286 1.229
## 871 1 0 0 0 2 0.5000 0.6667 1.167
Clearly the high OPS outliers are caused by players with few at bats. Based on the scatterplot of OPS vs AB let’s choose 100 at bats as a threshold and look again.
Batsub <- subset(Batting.12, AB >= 100)
scatterplot(Batsub$AB, Batsub$OPS)
This looks like a reasonable group of players to examine. Now take a look at the OPS probability distribution.
hist(Batsub$OPS, main="Histogram of OPS")
plot(density(Batsub$OPS), main="Probability Density of OPS")
That looks closer to normal than I expected. Let’s look at some diagnostics.
qqPlot(Batsub$OPS, id.n=5, labels=Batsub$playerID)
## vottojo01 ortizda01 morelbr01 cabremi01 braunry02
## 440 439 1 438 437
shapiro.test(Batsub$OPS)
##
## Shapiro-Wilk normality test
##
## data: Batsub$OPS
## W = 0.9976, p-value = 0.7902
library(nortest)
ad.test(Batsub$OPS)
##
## Anderson-Darling normality test
##
## data: Batsub$OPS
## A = 0.1895, p-value = 0.8996
library(fBasics)
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.0.3
## Loading required package: timeDate
## Loading required package: timeSeries
##
## Attaching package: 'fBasics'
##
## The following object is masked from 'package:car':
##
## densityPlot
##
## The following object is masked from 'package:base':
##
## norm
basicStats(Batsub$OPS)
## X..Batsub.OPS
## nobs 4.400e+02
## NAs 0.000e+00
## Minimum 4.197e-01
## Maximum 1.041e+00
## 1. Quartile 6.474e-01
## 3. Quartile 7.865e-01
## Mean 7.177e-01
## Median 7.184e-01
## Sum 3.158e+02
## SE Mean 5.112e-03
## LCL Mean 7.076e-01
## UCL Mean 7.277e-01
## Variance 1.150e-02
## Stdev 1.072e-01
## Skewness 2.515e-02
## Kurtosis 1.509e-02
That looks close to normal to me and the normality tests are unable to reject the hypothesis of normality. See http://en.wikipedia.org/wiki/Shapiro-Wilk_test