I have often wondered whether certain players have an advantage if they bat from one side of the plate and throw with the opposite hand.
This curiosity springs from a comment my son’s golf instructor made about left-handedness. My son was left-handed, but the pro said, “Golf is a left-handed game, so he would be better off playing right-handed.”
This made me wonder about baseball. The right-handed batter faces a pitcher seeing primarily with his left eye (most right-handed people are right-eyed.) This seems like a major disadvantage; you are using your weaker eye for the most important event in baseball.
The purpose of this study is to determine whether players who use opposite hitting-throwing configurations, i.e. bats left - throws right or bats right - throws left, have a statistically valid advantage over conventional players. In the process of this investigation, we will explore:
Fortunately, the Kaggle dataset called “The History of Baseball” has most of what we need to investigate these questions.
#library (data.table)
library(ggplot2) # Data visualization
library(dplyr)
library(knitr)
#
##setwd("~/GitHub/Baseball") # for local processing
#
##Load the Baseball Datasets
#
players = read.csv("input/player.csv")
batting = read.csv("input/batting.csv")
pitchers = read.csv ("input/pitching.csv")
pitcherid = unique(pitchers$player_id)
nonPitchers = players[!(players$player_id %in% (pitcherid)) ,]
#dim(nonPitchers)
nonPitchers = nonPitchers %>% filter(throws != "");
nonPitchers = nonPitchers %>% filter(bats != "");
#dim(nonPitchers)
nonPitcherID = nonPitchers$player_id
Note that we loaded the pitcher data too. This will become obvious as we explore the data.
Our first task is to do a little cleanup of the data. Because at-bats defines a plate appearance, eliminate the NA and the zero at-bat records. We also eliminate records where there is no record of how the player bats or throws. This information was only kept since around 1915.
We also merge the hitters with the non-pitchers, which we will need later. As an aside, this eliminates perhaps one of the greatest hitters ever, Babe Ruth, who spent the first portion of his career as a pitcher with the Red Sox. He is an outlier. Elimination of a single player will not alter the statistical validity of our results.
battingwoNA = batting[is.na(batting$ab)==FALSE,]
battingwoNA = battingwoNA[battingwoNA$ab>0,]
playerBatsThrows = players %>% filter(throws != "");
playerBatsThrows = playerBatsThrows %>% filter(bats != "");
# These two statements first merge the player list with
# the batting information without NAs.
# Then create the hitters dataset by merging the
# batters without NAs wit the non-pitchers.
player.batting = merge(playerBatsThrows,battingwoNA)
hitters= merge(nonPitchers,battingwoNA)
dim(player.batting)
## [1] 84236 45
dim(hitters)
## [1] 51066 45
Because the metric we are using to evaluate hitting effectiveness is batting average, let’s calculate it over the data by year to see if there is anything interesting.
x<-player.batting %>% group_by(year) %>%
mutate(season_total_ab=sum(ab)) %>%
mutate(season_total_h=sum(h)) %>%
mutate (battingAverage = season_total_h/season_total_ab) %>%
ungroup()
plot (x=x$year,y=x$battingAverage,xlab = "Year",ylab="Batting Averages",main="Yearly Batting Averages")
There is an interesting point in the late 1960s where the composite batting average was much lower. Let’s zoom in on that..
unique(x[x$battingAverage<0.238,]$year)
## [1] 1968
The year was 1968. The graph seems to indicate that averages had been steadily declining for many years. That year was called “The Year of the Pitcher.” Further research shows that baseball officials were concerned about this and introduced major rule changes to improve the chances of the batter, such as lowering the pitching mound and shrinking the strike zone. See https://en.wikipedia.org/wiki/1968_Major_League_Baseball_season for more information.
Now we look at the descriptive statistics for the various combinations of hitting and throwing. There are three states for hitting, L, R, and B (for switch hitting), and two for throwing, L and R. Apparently there have been no ambidextrous major league players.
Note that uses the computed batting average by year for convenience, which makes it easier to understand some of the calculations. The another method, used later, computes weighted batting averages over the entire population.
First, we develop a function to do this calculation repeatedly.
computeBattingAverages = function (df) {
#first we need to introduce bats and throws into the groupings
#This calculates the batting average for group
x<-df %>% group_by(bats,throws) %>%
mutate(total_ab=sum(ab)) %>%
mutate(total_h=sum(h)) %>%
mutate (battingAverage = total_h/total_ab) %>%
ungroup()
# quick look at the numbers
kable(table(x$bats,x$throws))
rr = x[x$bats=="R" & x$throws=="R",]
rl = x[x$bats=="R" & x$throws=="L",]
lr = x[x$bats=="L" & x$throws=="R",]
ll = x[x$bats=="L" & x$throws=="L",]
br = x[x$bats=="B" & x$throws=="R",]
bl = x[x$bats=="B" & x$throws=="L",]
averages = NULL
BatsThrows = c("RR","RL","LR","LL","BR","BL")
Average = c(round(mean(rr$battingAverage,na.rm=T),3),
round(mean(rl$battingAverage,na.rm=T),3),
round(mean(lr$battingAverage,na.rm=T),3),
round(mean(ll$battingAverage,na.rm=T),3),
round(mean(br$battingAverage,na.rm=T),3),
round(mean(bl$battingAverage,na.rm=T),3)
)
averages = data.frame(BatsThrows,Average)
kable(averages,caption = "Batting Averages by Hits-Throws")
averages
}
avg = computeBattingAverages(player.batting)
kable(avg,caption = "Batting Averages by Hits-Throws for All Hitters")
| BatsThrows | Average |
|---|---|
| RR | 0.257 |
| RL | 0.223 |
| LR | 0.273 |
| LL | 0.272 |
| BR | 0.261 |
| BL | 0.248 |
plot(avg$BatsThrows,avg$Average,main="Batting Averages by Hits-Throws - All Hitters")
The result is somewhat unexpected, giving the original question. The batting average for players who bat right and throw left is much lower that all the others.
This is most likely due to the fact that in the original data, pitchers were included as hitters. There is probably a disproportionate number of left-handed pitchers which may be biasing the result. Remembering our original discussion about datasets, we now know why the pitcher dataset might be needed.
The other odd fact is that players who bat left (both left and right-handed throwers) have the highest overall average. Why?
We have already established the non-pitchers in a separate dataset. The use of an inner join established the batting records for all non-pitchers.
We look first at the overall average for all hitters in the hitters dataset (excluding pitchers.)
sum(hitters$h)/sum(hitters$ab)
## [1] 0.2671092
avg = computeBattingAverages(hitters)
kable(avg,caption = "Batting Averages by Hits-Throws - Hitters Only (No Pitchers)")
| BatsThrows | Average |
|---|---|
| RR | 0.264 |
| RL | 0.262 |
| LR | 0.273 |
| LL | 0.277 |
| BR | 0.264 |
| BL | 0.271 |
plot(avg$BatsThrows,avg$Average,main="Batting Averages by Hits-Throws - Hitters Only (No Pitchers)")
This set of averages seems to more reasonably represent the profile of major league batters. The assumption is that there are probably proportionally more left-hand pitchers, which would account for the lower average of left-hand throwers and right-hitting batters.
However, it does not seem to support our original question, whether players who bat on one side and throw from the other are better hitters.
It does surface another question, are left-hitting batters better hitters than right-hitting batters? We will do this more formally, with the z statistic, which is used to compare whether the difference between two population proportions is equal to zero.
The assumption is that there are two populations, left and right-hitting batters.
Our null hypothesis H0 is that the two population batting averages are equal, i.e. mu(right) = mu(left).
The alternative hypothesis h1 is that mu(right) < mu(left) [the left hitting players are better hitters].
The prop.test in R computes a Chi-square test. We use a 1% significance level. The critical region for the Chi-square statistic is 6.64 with 1 degree of freedom.
An alternative might be to use weighted averages to normalize players’ averages. The use weighted means and variance, in which more significant player averages contribute more. The at-bat information will weight the averages correctly. Fortunately, R has a nice library called SDMTools which handles weighted means and variances. The R t.test function will not work in this problem, because the calculations must be weighted to compute correct means and variances.
First let’s split the data into two populations, left and right-hitting batters.
library(SDMTools)
hitters$ba = hitters$h/hitters$ab
lefties = hitters[hitters$bats=="L",]
righties = hitters[hitters$bats=="R",]
switch = hitters[hitters$bats=="B",]
n1 = nrow(lefties)
n2 = nrow(righties)
bnum = nrow(switch)
n1
## [1] 15875
n2
## [1] 30188
There are 15875 players who bat left, 30188 who bat right, and 5003 who bat both.
Next we use the proportional test in R to compare the two proportions, where the proportions are the respective batting averages of left and right hitting players.
left_ab = sum(lefties$ab)
left_hits = sum(lefties$h)
right_ab = sum(righties$ab)
right_hits = sum(righties$h)
z =prop.test (c(left_hits,right_hits),c(left_ab,right_ab),p = NULL,alternative = "greater")
z
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(left_hits, right_hits) out of c(left_ab, right_ab)
## X-squared = 1512.1, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.0104844 1.0000000
## sample estimates:
## prop 1 prop 2
## 0.2744767 0.2635273
Chi-square has a value of 1512, which is much greater than the critical value 6.64, so we reject the null hypothesis and conclude that the batting averages are higher for left-hitting hitters. This seems obvious with the 10+ difference in batting averages.
When we look at the histogram plots of the batting averages and player plate appearances, one thing jumps out. There are large numbers of averages, over 3000, that are zero. Could those be biasing our proportion results?
hist(hitters$ba,breaks=50,main="All Hiiters' Batting Averages",xlab = "Seasonal Batting Average")
hist(hitters$ab,breaks=50,main="All Hitters' At-Bats",xlab = "Average At-Bats")
We also eliminate the utility players. Our definition of a full time player is someone with at least 150 at bats.
First regenerate our two populations, eliminating the 0 hit seasons. This should eliminate the concentration of entries at 0.
hitters1 = hitters[hitters$h>0 & hitters$ab >=150,]
#dim(hitters1)
lefties = hitters1[hitters1$bats=="L",]
righties = hitters1[hitters1$bats=="R",]
oppositeLR = hitters1[hitters1$bats=="L" & hitters1$throws=="R",]
oppositeRL = hitters1[hitters1$bats=="R" & hitters1$throws=="L",]
sameLL = hitters1[hitters1$bats=="L" & hitters1$throws=="L",]
sameRR = hitters1[hitters1$bats=="R" & hitters1$throws=="R",]
batsThrowsOpposite = rbind(oppositeLR,oppositeRL)
batsThrowsSame = rbind (sameLL,sameRR)
OppositeBA = sum(batsThrowsOpposite$h)/sum(batsThrowsOpposite$ab)
SameBA = sum(batsThrowsSame$h)/sum(batsThrowsSame$ab)
hist(hitters1$ba,breaks=50, main="All Full-Time Hitters' Batting Averages ",xlab = "Season Batting Average")
hist(hitters1$ab,breaks=50,main="All Full-Time Hitters' At-Bats", xlab = "Average At-Bats")
n1 = nrow(lefties)
n2 = nrow(righties)
These look good. The batting averages appear to be normally distributed and plate appearances seem to be fairly uniform from 150 to around 600 per year.
We compute the various statistics and z. We are recomputing the comparison between left and right hitting full-time batters.
left_ab = sum(lefties$ab)
left_hits = sum(lefties$h)
right_ab = sum(righties$ab)
right_hits = sum(righties$h)
z =prop.test (c(left_hits,right_hits),c(left_ab,right_ab),p = NULL,alternative = "greater")
z
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(left_hits, right_hits) out of c(left_ab, right_ab)
## X-squared = 1311.1, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.010294 1.000000
## sample estimates:
## prop 1 prop 2
## 0.2787064 0.2679207
Chi-square has a value of 1311, which is much greater than the critical value 6.64, so we reject the null hypothisis and conclude that the batting averages are significantly higher for left-hitting hitters.
Major league baseball has studied this and offered reasons why left-batting hitters have higher averages. One reason is that they start a step or two closer to first base, thus giving them a better chance of an infield hit. Also, they face predominately right-hand pitchers, which makes the ball easier to see.
Now that we have reasonable results, let’s look at our original question.
Next rebuild our original table with the updated data - hitters with at least 150 at bats in a season.
avg = computeBattingAverages(hitters1)
kable(avg,caption = "Batting Averages of Hitters by Hits-Throws")
| BatsThrows | Average |
|---|---|
| RR | 0.268 |
| RL | 0.270 |
| LR | 0.277 |
| LL | 0.281 |
| BR | 0.267 |
| BL | 0.275 |
It is not obvious whether there is any advantage to players who bat-throw from opposite sides. Fortunately, we previously computed combined averages for opposite and same hitting-throwing batters. The null hypothesis now is, for opposite batting-throwing hitters, they hit similar to players who bat and throw from the same side. The alternative hypothesis is that player who bat and throw from opposite sides are significantly better hitters.
First look at the averages. Opposite hitting-throwing batting averages are 0.277 Same hitting-throwing batting averages are 0.27 It appears that opposite averages are higher. We test this with a proportion test.
opposite_ab = sum(batsThrowsOpposite$ab)
opposite_hits = sum(batsThrowsOpposite$h)
same_ab = sum(batsThrowsSame$ab)
same_hits = sum(batsThrowsSame$h)
z =prop.test (c(opposite_hits,same_hits),c(opposite_ab,same_ab),p = NULL,alternative = "greater")
z
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: c(opposite_hits, same_hits) out of c(opposite_ab, same_ab)
## X-squared = 407.71, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
## 0.006390188 1.000000000
## sample estimates:
## prop 1 prop 2
## 0.2772688 0.2703089
Chi-square has a value of 408, which is greater than the critical value 6.64, so we reject the null hypothesis, accepting the alternative, concluding that the batting averages are significantly higher for full-time players who bat and throw from opposite sides. This supports the original idea - that there is an advantage to a hitter looking at the pitcher with his dominant eye.
There are several more related questions that might be investigated. For example:
Each of those seems counter-intuitive, given what we learned about LR hitters. A follow-up study should investigate whether left-hitting players make significantly more money. This is an amazingly fun dataset and offers a nice playground for curious data scientists who love baseball!
Explanation of the 1968 season - https://en.wikipedia.org/wiki/1968_Major_League_Baseball_season
Discussion of left-hitting batters - http://probaseballinsider.com/advantages-of-being-a-left-handed-hitter/
Source for baseball data - https://www.kaggle.com/seanlahman/the-history-of-baseball