Mission

My goal is to see how well correlated home runs are with strikeouts. The question: does more home runs indicate more strikeouts? How does this compare to hits versus strikeouts?

My hypothesis is that most of the big name players with a lot of home runs also accumulate copious amounts of strikeouts. I anticipate that hits are generally less correlated with strikeouts than home runs are.

Step 1

link <- 'https://raw.githubusercontent.com/st3vejobs/mlb_18/main/mlb_players_18.csv'
mlbraw <- read.csv(url(link), na.strings = "")


mlb <- mlbraw[mlbraw$AB > 10, ]

summary(mlb)
##        X             name               team             position        
##  Min.   : 25.0   Length:696         Length:696         Length:696        
##  1st Qu.:215.8   Class :character   Class :character   Class :character  
##  Median :401.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :404.5                                                           
##  3rd Qu.:586.2                                                           
##  Max.   :925.0                                                           
##      games              AB              R                H         
##  Min.   :  3.00   Min.   : 11.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 28.00   1st Qu.: 53.0   1st Qu.:  4.75   1st Qu.:  9.00  
##  Median : 69.00   Median :187.0   Median : 21.00   Median : 42.50  
##  Mean   : 75.92   Mean   :236.4   Mean   : 30.98   Mean   : 58.78  
##  3rd Qu.:125.00   3rd Qu.:408.0   3rd Qu.: 52.00   3rd Qu.:102.25  
##  Max.   :162.00   Max.   :664.0   Max.   :129.00   Max.   :192.00  
##     doubles         triples             HR              RBI        
##  Min.   : 0.00   Min.   : 0.000   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.: 1.75   1st Qu.: 0.000   1st Qu.: 1.000   1st Qu.:  4.00  
##  Median : 8.00   Median : 0.000   Median : 4.000   Median : 20.00  
##  Mean   :11.85   Mean   : 1.214   Mean   : 8.016   Mean   : 29.55  
##  3rd Qu.:19.25   3rd Qu.: 2.000   3rd Qu.:13.000   3rd Qu.: 50.00  
##  Max.   :51.00   Max.   :12.000   Max.   :48.000   Max.   :130.00  
##      walks         strike_outs      stolen_bases   caught_stealing_base
##  Min.   :  0.00   Min.   :  0.00   Min.   : 0.00   Min.   : 0.000      
##  1st Qu.:  3.00   1st Qu.: 19.00   1st Qu.: 0.00   1st Qu.: 0.000      
##  Median : 15.00   Median : 45.00   Median : 1.00   Median : 0.000      
##  Mean   : 22.48   Mean   : 58.58   Mean   : 3.54   Mean   : 1.375      
##  3rd Qu.: 35.00   3rd Qu.: 93.00   3rd Qu.: 4.00   3rd Qu.: 2.000      
##  Max.   :130.00   Max.   :217.00   Max.   :45.00   Max.   :14.000      
##       AVG              OBP              SLG              OPS        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1817   1st Qu.:0.2500   1st Qu.:0.2667   1st Qu.:0.5222  
##  Median :0.2330   Median :0.2980   Median :0.3715   Median :0.6720  
##  Mean   :0.2146   Mean   :0.2775   Mean   :0.3429   Mean   :0.6204  
##  3rd Qu.:0.2620   3rd Qu.:0.3310   3rd Qu.:0.4310   3rd Qu.:0.7600  
##  Max.   :0.3550   Max.   :0.4600   Max.   :0.7100   Max.   :1.0880
meanavg <- mean(mlb$AVG, na.rm = TRUE)
medavg <- median(mlb$AVG, na.rm = TRUE)
maxhr <- max(mlb$HR, na.rm = TRUE)
which(mlb$HR == maxhr)
## [1] 268
HRchamp <- as.numeric(which(mlb$HR == maxhr))
mlb[HRchamp, 2]
## [1] " Davis, K"
mlb[HRchamp, c(2,11)]
##          name HR
## 321  Davis, K 48
meanavg
## [1] 0.2145503
medavg
## [1] 0.233

The Home Run Champion for 2018 was Khris Davis. The mean batting average for all MLB (greater than 10 at bats) in 2018 was .2146

Step 2

hrs <- data.frame(cbind(mlb$name, mlb$HR))
colnames(hrs) <- c('Player','HR')
hrsort <- hrs[order(mlb$HR, decreasing = TRUE), ]
topten <- head(hrsort, n = 10)
colnames(topten)[1] <- 'Player'
topten
##           Player HR
## 268     Davis, K 48
## 4    Martinez, J 43
## 460     Gallo, J 40
## 12      Trout, M 39
## 129   Ramirez, J 39
## 37    Arenado, N 38
## 96     Lindor, F 38
## 148   Stanton, G 38
## 34    Machado, M 37
## 51      Story, T 37
Sox <- subset(mlb, team == 'BOS' | team == 'CWS')
boombust <- subset(mlb, strike_outs >= 100 & HR >= 30, select = c(name, strike_outs, HR))
soHR <- data.frame(cbind(mlb$name, mlb$strike_outs, mlb$HR))
colnames(soHR) <- c('Player', 'Strikeouts', 'HR')

mlb$singles <- mlb$H - mlb$doubles - mlb$triples - mlb$HR
mlb$TotalBasesHit <- 0
mlb$TotalBasesHit <- mlb$TotalBasesHit + 1*mlb$singles + 2*mlb$doubles + 3*mlb$triples + 4*mlb$HR

Step 3

library(ggplot2)
ggplot(mlb, aes(x = strike_outs, y = HR))+ 
  geom_point(size=1, color = "blue")+
  ggtitle('Correlation of Home Runs and Strikeouts')+
  xlab('Strikeouts')+
  ylab('Home Runs')+
  geom_smooth(method = 'loess', formula = y~x, color = 'green')+
  geom_smooth(method = 'lm', formula = y~x, color = 'red')+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(boombust, aes(x = strike_outs, y = HR))+ 
  geom_point(size=3, color = "red")+
  ggtitle('Correlation of Home Runs and Strikeouts for Boom/Bust Players')+
  xlab('Strikeouts')+
  ylab('Home Runs')+
  geom_smooth()+
  theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mlb, aes(x = strike_outs, y = H))+ 
  geom_point(size=1, color = "cyan")+
  ggtitle('Correlation of Hits and Strikeouts')+
  xlab('Strikeouts')+
  ylab('Hits')+
  geom_smooth()+
  theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mlb, aes(x=position, y=H))+
  geom_boxplot(outlier.color = 'red', outlier.size = 1)+
  xlab('Position')+
  ylab('Hits')+
  ggtitle('Hits by Position')+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(mlb, aes(x=AVG))+
  geom_histogram(aes(y=..density..), color = 'black', fill = 'deepskyblue')+
  geom_density(alpha=.2, fill = 'red')+
  ggtitle('League Batting Average')+
  theme(plot.title = element_text(hjust = 0.5))+
  xlab('Batting Average')+
  ylab('Density')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Findings

From the scatter plot, it is easy to see that there is a nearly linear relationship between home runs and strikeouts, on average. From this, it can be concluded that an increase in strikeouts may indicate an increase in home runs, and an increase in home runs is even more indicative of an increase in strikeouts.

Interestingly, hits and strikeouts are much less correlated than home runs and strikeouts. There is a much less linear relationship between the two and the data is far more variable.

I also wanted to see how hits vary by position. The most variation is found in the Shortstop position. Pitchers have the least variation in their success levels. Designated Hitters are the most successful batters on average.

The histogram shows the general distribution of batting averages across the league. Most batters fall between 0.2 and 0.3.