Overview of the Investigation

I have often wondered whether certain players have an advantage if they bat from one side of the plate and throw with the opposite hand.

This curiosity springs from a comment my son’s golf instructor made about left-handedness. My son was left-handed, but the pro said, “Golf is a left-handed game, so he would be better off playing right-handed.”

This made me wonder about baseball. The right-handed batter faces a pitcher seeing primarily with his left eye (most right-handed people are right-eyed.) This seems like a major disadvantage; you are using your weaker eye for the most important event in baseball.

The purpose of this study is to determine whether players who use opposite hitting-throwing configurations, i.e. bats left - throws right or bats right - throws left, have a statistically valid advantage over conventional players. In the process of this investigation, we will explore:

Fortunately, the Kaggle dataset called “The History of Baseball” has most of what we need to investigate these questions.


Establish Libraries and Load the Data

#library (data.table)
library(ggplot2) # Data visualization
library(dplyr)
library(knitr)

#
##setwd("~/GitHub/Baseball") # for local processing 
#
##Load the Baseball Datasets
#

players = read.csv("input/player.csv")
batting = read.csv("input/batting.csv")
pitchers = read.csv ("input/pitching.csv")
pitcherid = unique(pitchers$player_id)
nonPitchers = players[!(players$player_id %in% (pitcherid)) ,]
#dim(nonPitchers)
nonPitchers = nonPitchers %>% filter(throws != "");
nonPitchers = nonPitchers %>% filter(bats != "");
#dim(nonPitchers)
nonPitcherID = nonPitchers$player_id

Note that we loaded the pitcher data too. This will become obvious as we explore the data.

Data Exploration

Our first task is to do a little cleanup of the data. Because at-bats defines a plate appearance, eliminate the NA and the zero at-bat records. We also eliminate records where there is no record of how the player bats or throws. This information was only kept since around 1915.

We also merge the hitters with the non-pitchers, which we will need later. As an aside, this eliminates perhaps one of the greatest hitters ever, Babe Ruth, who spent the first portion of his career as a pitcher with the Red Sox. He is an outlier. Elimination of a single player will not alter the statistical validity of our results.

battingwoNA = batting[is.na(batting$ab)==FALSE,]
battingwoNA = battingwoNA[battingwoNA$ab>0,]

playerBatsThrows = players %>% filter(throws != "");
playerBatsThrows = playerBatsThrows %>% filter(bats != "");

# These two statements first merge the player list with
# the batting information without NAs. 
# Then create the hitters dataset by merging the 
# batters without NAs wit the non-pitchers.

player.batting = merge(playerBatsThrows,battingwoNA)
hitters= merge(nonPitchers,battingwoNA)
dim(player.batting)
## [1] 84236    45
dim(hitters)
## [1] 51066    45

Because the metric we are using to evaluate hitting effectiveness is batting average, let’s calculate it over the data by year to see if there is anything interesting.

x<-player.batting %>% group_by(year) %>%     
    mutate(season_total_ab=sum(ab)) %>% 
    mutate(season_total_h=sum(h)) %>% 
    mutate (battingAverage = season_total_h/season_total_ab) %>%
    ungroup()
plot (x=x$year,y=x$battingAverage,xlab = "Year",ylab="Batting Averages",main="Yearly Batting Averages")

There is an interesting point in the late 1960s where the composite batting average was much lower. Let’s zoom in on that..

unique(x[x$battingAverage<0.238,]$year)
## [1] 1968

The year was 1968. The graph seems to indicate that averages had been steadily declining for many years. That year was called “The Year of the Pitcher.” Further research shows that baseball officials were concerned about this and introduced major rule changes to improve the chances of the batter, such as lowering the pitching mound and shrinking the strike zone. See https://en.wikipedia.org/wiki/1968_Major_League_Baseball_season for more information.

Left-Right Study

Now we look at the descriptive statistics for the various combinations of hitting and throwing. There are three states for hitting, L, R, and B (for switch hitting), and two for throwing, L and R. Apparently there have been no ambidextrous major league players.

Note that uses the computed batting average by year for convenience, which makes it easier to understand some of the calculations. The another method, used later, computes weighted batting averages over the entire population.

First, we develop a function to do this calculation repeatedly.

computeBattingAverages = function (df) {

#first we need to introduce bats and throws into the groupings
#This calculates the batting average for group

x<-df %>% group_by(bats,throws) %>%     
    mutate(total_ab=sum(ab)) %>% 
    mutate(total_h=sum(h)) %>% 
    mutate (battingAverage = total_h/total_ab) %>%
    ungroup()
# quick look at the numbers
kable(table(x$bats,x$throws))
rr = x[x$bats=="R" & x$throws=="R",]
rl = x[x$bats=="R" & x$throws=="L",]
lr = x[x$bats=="L" & x$throws=="R",]
ll = x[x$bats=="L" & x$throws=="L",]
br = x[x$bats=="B" & x$throws=="R",]
bl = x[x$bats=="B" & x$throws=="L",]
averages = NULL

BatsThrows = c("RR","RL","LR","LL","BR","BL")
Average = c(round(mean(rr$battingAverage,na.rm=T),3),
      round(mean(rl$battingAverage,na.rm=T),3),
      round(mean(lr$battingAverage,na.rm=T),3),
      round(mean(ll$battingAverage,na.rm=T),3),
      round(mean(br$battingAverage,na.rm=T),3),
      round(mean(bl$battingAverage,na.rm=T),3)
      )
averages = data.frame(BatsThrows,Average)
kable(averages,caption = "Batting Averages by Hits-Throws")
averages
}
avg = computeBattingAverages(player.batting)
kable(avg,caption = "Batting Averages by Hits-Throws for All Hitters")
Batting Averages by Hits-Throws for All Hitters
BatsThrows Average
RR 0.257
RL 0.223
LR 0.273
LL 0.272
BR 0.261
BL 0.248
plot(avg$BatsThrows,avg$Average,main="Batting Averages by Hits-Throws - All Hitters")

The result is somewhat unexpected, giving the original question. The batting average for players who bat right and throw left is much lower that all the others.

This is most likely due to the fact that in the original data, pitchers were included as hitters. There is probably a disproportionate number of left-handed pitchers which may be biasing the result. Remembering our original discussion about datasets, we now know why the pitcher dataset might be needed.

The other odd fact is that players who bat left (both left and right-handed throwers) have the highest overall average. Why?

Eliminate the pitchers

We have already established the non-pitchers in a separate dataset. The use of an inner join established the batting records for all non-pitchers.

We look first at the overall average for all hitters in the hitters dataset (excluding pitchers.)

sum(hitters$h)/sum(hitters$ab)
## [1] 0.2671092
avg = computeBattingAverages(hitters)
kable(avg,caption = "Batting Averages by Hits-Throws - Hitters Only (No Pitchers)")
Batting Averages by Hits-Throws - Hitters Only (No Pitchers)
BatsThrows Average
RR 0.264
RL 0.262
LR 0.273
LL 0.277
BR 0.264
BL 0.271
plot(avg$BatsThrows,avg$Average,main="Batting Averages by Hits-Throws - Hitters Only (No Pitchers)")

This set of averages seems to more reasonably represent the profile of major league batters. The assumption is that there are probably proportionally more left-hand pitchers, which would account for the lower average of left-hand throwers and right-hitting batters.

However, it does not seem to support our original question, whether players who bat on one side and throw from the other are better hitters.

It does surface another question, are left-hitting batters better hitters than right-hitting batters? We will do this more formally, with the z statistic, which is used to compare whether the difference between two population proportions is equal to zero.

z-Test Statistic for Proportions - Left Hitting vs. Right Hitting

The assumption is that there are two populations, left and right-hitting batters.

Our null hypothesis H0 is that the two population batting averages are equal, i.e. mu(right) = mu(left).

The alternative hypothesis h1 is that mu(right) < mu(left) [the left hitting players are better hitters].

The prop.test in R computes a Chi-square test. We use a 1% significance level. The critical region for the Chi-square statistic is 6.64 with 1 degree of freedom.

An alternative might be to use weighted averages to normalize players’ averages. The use weighted means and variance, in which more significant player averages contribute more. The at-bat information will weight the averages correctly. Fortunately, R has a nice library called SDMTools which handles weighted means and variances. The R t.test function will not work in this problem, because the calculations must be weighted to compute correct means and variances.

The z Proportional Calculations

First let’s split the data into two populations, left and right-hitting batters.

library(SDMTools)
hitters$ba = hitters$h/hitters$ab
lefties = hitters[hitters$bats=="L",]
righties = hitters[hitters$bats=="R",]
switch = hitters[hitters$bats=="B",]

n1 = nrow(lefties)
n2 = nrow(righties)
bnum = nrow(switch)
n1
## [1] 15875
n2
## [1] 30188

There are 15875 players who bat left, 30188 who bat right, and 5003 who bat both.

Next we use the proportional test in R to compare the two proportions, where the proportions are the respective batting averages of left and right hitting players.

left_ab = sum(lefties$ab)
left_hits = sum(lefties$h)
right_ab = sum(righties$ab)
right_hits = sum(righties$h)
z =prop.test (c(left_hits,right_hits),c(left_ab,right_ab),p = NULL,alternative = "greater")
z
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(left_hits, right_hits) out of c(left_ab, right_ab)
## X-squared = 1512.1, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.0104844 1.0000000
## sample estimates:
##    prop 1    prop 2 
## 0.2744767 0.2635273

Chi-square has a value of 1512, which is much greater than the critical value 6.64, so we reject the null hypothesis and conclude that the batting averages are higher for left-hitting hitters. This seems obvious with the 10+ difference in batting averages.

Further Analysis - Infrequent Plate Appearances and Utility Players

When we look at the histogram plots of the batting averages and player plate appearances, one thing jumps out. There are large numbers of averages, over 3000, that are zero. Could those be biasing our proportion results?

hist(hitters$ba,breaks=50,main="All Hiiters' Batting Averages",xlab = "Seasonal Batting Average")

hist(hitters$ab,breaks=50,main="All Hitters' At-Bats",xlab = "Average At-Bats")

We also eliminate the utility players. Our definition of a full time player is someone with at least 150 at bats.

First regenerate our two populations, eliminating the 0 hit seasons. This should eliminate the concentration of entries at 0.

hitters1 = hitters[hitters$h>0 & hitters$ab >=150,]
#dim(hitters1)
lefties = hitters1[hitters1$bats=="L",]
righties = hitters1[hitters1$bats=="R",]
oppositeLR = hitters1[hitters1$bats=="L" & hitters1$throws=="R",]
oppositeRL = hitters1[hitters1$bats=="R" & hitters1$throws=="L",]

sameLL = hitters1[hitters1$bats=="L" & hitters1$throws=="L",]
sameRR = hitters1[hitters1$bats=="R" & hitters1$throws=="R",]

batsThrowsOpposite = rbind(oppositeLR,oppositeRL)
batsThrowsSame = rbind (sameLL,sameRR)
OppositeBA = sum(batsThrowsOpposite$h)/sum(batsThrowsOpposite$ab)
SameBA = sum(batsThrowsSame$h)/sum(batsThrowsSame$ab)

hist(hitters1$ba,breaks=50, main="All Full-Time Hitters' Batting Averages ",xlab = "Season Batting Average")

hist(hitters1$ab,breaks=50,main="All Full-Time Hitters' At-Bats", xlab = "Average At-Bats")

n1 = nrow(lefties)
n2 = nrow(righties)

These look good. The batting averages appear to be normally distributed and plate appearances seem to be fairly uniform from 150 to around 600 per year.

We compute the various statistics and z. We are recomputing the comparison between left and right hitting full-time batters.

left_ab = sum(lefties$ab)
left_hits = sum(lefties$h)
right_ab = sum(righties$ab)
right_hits = sum(righties$h)
z =prop.test (c(left_hits,right_hits),c(left_ab,right_ab),p = NULL,alternative = "greater")
z
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(left_hits, right_hits) out of c(left_ab, right_ab)
## X-squared = 1311.1, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.010294 1.000000
## sample estimates:
##    prop 1    prop 2 
## 0.2787064 0.2679207

Chi-square has a value of 1311, which is much greater than the critical value 6.64, so we reject the null hypothisis and conclude that the batting averages are significantly higher for left-hitting hitters.

Major league baseball has studied this and offered reasons why left-batting hitters have higher averages. One reason is that they start a step or two closer to first base, thus giving them a better chance of an infield hit. Also, they face predominately right-hand pitchers, which makes the ball easier to see.

Now that we have reasonable results, let’s look at our original question.

Revisiting the Question - Do players who bat from one side and throw with the other hand have an advantage?

Next rebuild our original table with the updated data - hitters with at least 150 at bats in a season.

avg = computeBattingAverages(hitters1)
kable(avg,caption = "Batting Averages of Hitters by Hits-Throws")
Batting Averages of Hitters by Hits-Throws
BatsThrows Average
RR 0.268
RL 0.270
LR 0.277
LL 0.281
BR 0.267
BL 0.275

It is not obvious whether there is any advantage to players who bat-throw from opposite sides. Fortunately, we previously computed combined averages for opposite and same hitting-throwing batters. The null hypothesis now is, for opposite batting-throwing hitters, they hit similar to players who bat and throw from the same side. The alternative hypothesis is that player who bat and throw from opposite sides are significantly better hitters.

First look at the averages. Opposite hitting-throwing batting averages are 0.277 Same hitting-throwing batting averages are 0.27 It appears that opposite averages are higher. We test this with a proportion test.

opposite_ab = sum(batsThrowsOpposite$ab)
opposite_hits = sum(batsThrowsOpposite$h)
same_ab = sum(batsThrowsSame$ab)
same_hits = sum(batsThrowsSame$h)
z =prop.test (c(opposite_hits,same_hits),c(opposite_ab,same_ab),p = NULL,alternative = "greater")
z
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(opposite_hits, same_hits) out of c(opposite_ab, same_ab)
## X-squared = 407.71, df = 1, p-value < 2.2e-16
## alternative hypothesis: greater
## 95 percent confidence interval:
##  0.006390188 1.000000000
## sample estimates:
##    prop 1    prop 2 
## 0.2772688 0.2703089

Chi-square has a value of 408, which is greater than the critical value 6.64, so we reject the null hypothesis, accepting the alternative, concluding that the batting averages are significantly higher for full-time players who bat and throw from opposite sides. This supports the original idea - that there is an advantage to a hitter looking at the pitcher with his dominant eye.

Further Investigation

There are several more related questions that might be investigated. For example:

Each of those seems counter-intuitive, given what we learned about LR hitters. A follow-up study should investigate whether left-hitting players make significantly more money. This is an amazingly fun dataset and offers a nice playground for curious data scientists who love baseball!

References

Explanation of the 1968 season - https://en.wikipedia.org/wiki/1968_Major_League_Baseball_season

Discussion of left-hitting batters - http://probaseballinsider.com/advantages-of-being-a-left-handed-hitter/

Source for baseball data - https://www.kaggle.com/seanlahman/the-history-of-baseball