A brief investigation of the distribution of OPS and whether or not it is normally distributed within MLB players. See this discussion

Here we will look at 2012 data from the Lahman database. Based on http://baseballwithr.wordpress.com/2013/12/18/regression-of-ops-stats/

library(Lahman)
library(plyr)
library(car)
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:Lahman':
## 
##     Salaries
Batting.12 <- subset(Batting, yearID==2012)

# Collapse different stints for the same player
sum.function <- function(d){
  d1 <- d[, 6:23] # RES: Should double check this
  apply(d1, 2, sum)
}
Batting.12 <- ddply(Batting.12, .(playerID, yearID), sum.function)

# Compute OPS
Batting.12$X1B <- with(Batting.12, H - X2B - X3B - HR)
Batting.12$SLG <- with(Batting.12,
                         (X1B + 2 * X2B + 3 * X3B + 4 * HR) / AB)
Batting.12$OBP <- with(Batting.12,
                         (H + BB + HBP) / (AB + BB + HBP + SF))
Batting.12$OPS <- with(Batting.12, SLG + OBP)

First take a quick look at the OPS for all players.

ops <- Batting.12$OPS
summary(ops)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0     0.3     0.6     0.5     0.7     2.0     342
boxplot(ops)

plot of chunk unnamed-chunk-2

scatterplot(Batting.12$AB, ops)

plot of chunk unnamed-chunk-2

head(Batting.12[order(ops, decreasing=TRUE),])
##       playerID yearID  G G_batting AB R H X2B X3B HR RBI SB CS BB SO IBB
## 516  hernada01   2012 72        NA  1 0 1   0   0  0   0  0  0  0  0   0
## 1008 rodrian01   2012  1        NA  1 0 1   0   0  0   0  0  0  0  0   0
## 789  millesh01   2012  6        NA  3 0 2   1   0  0   0  0  0  0  0   0
## 595  johnsda06   2012 14        NA 22 8 8   1   0  3   6  0  0  9  3   1
## 1009 rodried04   2012  2        NA  5 1 1   0   0  1   1  0  0  2  3   0
## 871  ortegra01   2012  2        NA  4 0 2   0   0  0   0  1  0  1  2   0
##      HBP SH SF GIDP X1B    SLG    OBP   OPS
## 516    0  0  0    0   1 1.0000 1.0000 2.000
## 1008   0  0  0    0   1 1.0000 1.0000 2.000
## 789    0  0  0    0   1 1.0000 0.6667 1.667
## 595    0  0  0    0   4 0.8182 0.5484 1.367
## 1009   0  0  0    0   0 0.8000 0.4286 1.229
## 871    1  0  0    0   2 0.5000 0.6667 1.167

Clearly the high OPS outliers are caused by players with few at bats. Based on the scatterplot of OPS vs AB let’s choose 100 at bats as a threshold and look again.

Batsub <- subset(Batting.12, AB >= 100)
scatterplot(Batsub$AB, Batsub$OPS)

plot of chunk unnamed-chunk-3

This looks like a reasonable group of players to examine. Now take a look at the OPS probability distribution.

hist(Batsub$OPS, main="Histogram of OPS")

plot of chunk unnamed-chunk-4

plot(density(Batsub$OPS), main="Probability Density of OPS")

plot of chunk unnamed-chunk-4

That looks closer to normal than I expected. Let’s look at some diagnostics.

qqPlot(Batsub$OPS, id.n=5, labels=Batsub$playerID)

plot of chunk unnamed-chunk-5

## vottojo01 ortizda01 morelbr01 cabremi01 braunry02 
##       440       439         1       438       437
shapiro.test(Batsub$OPS)
## 
##  Shapiro-Wilk normality test
## 
## data:  Batsub$OPS
## W = 0.9976, p-value = 0.7902
library(nortest)
ad.test(Batsub$OPS)
## 
##  Anderson-Darling normality test
## 
## data:  Batsub$OPS
## A = 0.1895, p-value = 0.8996
library(fBasics)
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.0.3
## Loading required package: timeDate
## Loading required package: timeSeries
## 
## Attaching package: 'fBasics'
## 
## The following object is masked from 'package:car':
## 
##     densityPlot
## 
## The following object is masked from 'package:base':
## 
##     norm
basicStats(Batsub$OPS)
##             X..Batsub.OPS
## nobs            4.400e+02
## NAs             0.000e+00
## Minimum         4.197e-01
## Maximum         1.041e+00
## 1. Quartile     6.474e-01
## 3. Quartile     7.865e-01
## Mean            7.177e-01
## Median          7.184e-01
## Sum             3.158e+02
## SE Mean         5.112e-03
## LCL Mean        7.076e-01
## UCL Mean        7.277e-01
## Variance        1.150e-02
## Stdev           1.072e-01
## Skewness        2.515e-02
## Kurtosis        1.509e-02

That looks close to normal to me and the normality tests are unable to reject the hypothesis of normality. See http://en.wikipedia.org/wiki/Shapiro-Wilk_test