My goal is to see how well correlated home runs are with strikeouts. The question: does more home runs indicate more strikeouts? How does this compare to hits versus strikeouts?
My hypothesis is that most of the big name players with a lot of home runs also accumulate copious amounts of strikeouts. I anticipate that hits are generally less correlated with strikeouts than home runs are.
link <- 'https://raw.githubusercontent.com/st3vejobs/mlb_18/main/mlb_players_18.csv'
mlbraw <- read.csv(url(link), na.strings = "")
mlb <- mlbraw[mlbraw$AB > 10, ]
summary(mlb)
## X name team position
## Min. : 25.0 Length:696 Length:696 Length:696
## 1st Qu.:215.8 Class :character Class :character Class :character
## Median :401.5 Mode :character Mode :character Mode :character
## Mean :404.5
## 3rd Qu.:586.2
## Max. :925.0
## games AB R H
## Min. : 3.00 Min. : 11.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 28.00 1st Qu.: 53.0 1st Qu.: 4.75 1st Qu.: 9.00
## Median : 69.00 Median :187.0 Median : 21.00 Median : 42.50
## Mean : 75.92 Mean :236.4 Mean : 30.98 Mean : 58.78
## 3rd Qu.:125.00 3rd Qu.:408.0 3rd Qu.: 52.00 3rd Qu.:102.25
## Max. :162.00 Max. :664.0 Max. :129.00 Max. :192.00
## doubles triples HR RBI
## Min. : 0.00 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 1.75 1st Qu.: 0.000 1st Qu.: 1.000 1st Qu.: 4.00
## Median : 8.00 Median : 0.000 Median : 4.000 Median : 20.00
## Mean :11.85 Mean : 1.214 Mean : 8.016 Mean : 29.55
## 3rd Qu.:19.25 3rd Qu.: 2.000 3rd Qu.:13.000 3rd Qu.: 50.00
## Max. :51.00 Max. :12.000 Max. :48.000 Max. :130.00
## walks strike_outs stolen_bases caught_stealing_base
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 19.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 15.00 Median : 45.00 Median : 1.00 Median : 0.000
## Mean : 22.48 Mean : 58.58 Mean : 3.54 Mean : 1.375
## 3rd Qu.: 35.00 3rd Qu.: 93.00 3rd Qu.: 4.00 3rd Qu.: 2.000
## Max. :130.00 Max. :217.00 Max. :45.00 Max. :14.000
## AVG OBP SLG OPS
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1817 1st Qu.:0.2500 1st Qu.:0.2667 1st Qu.:0.5222
## Median :0.2330 Median :0.2980 Median :0.3715 Median :0.6720
## Mean :0.2146 Mean :0.2775 Mean :0.3429 Mean :0.6204
## 3rd Qu.:0.2620 3rd Qu.:0.3310 3rd Qu.:0.4310 3rd Qu.:0.7600
## Max. :0.3550 Max. :0.4600 Max. :0.7100 Max. :1.0880
meanavg <- mean(mlb$AVG, na.rm = TRUE)
medavg <- median(mlb$AVG, na.rm = TRUE)
maxhr <- max(mlb$HR, na.rm = TRUE)
which(mlb$HR == maxhr)
## [1] 268
HRchamp <- as.numeric(which(mlb$HR == maxhr))
mlb[HRchamp, 2]
## [1] " Davis, K"
mlb[HRchamp, c(2,11)]
## name HR
## 321 Davis, K 48
meanavg
## [1] 0.2145503
medavg
## [1] 0.233
The Home Run Champion for 2018 was Khris Davis. The mean batting average for all MLB (greater than 10 at bats) in 2018 was .2146
hrs <- data.frame(cbind(mlb$name, mlb$HR))
colnames(hrs) <- c('Player','HR')
hrsort <- hrs[order(mlb$HR, decreasing = TRUE), ]
topten <- head(hrsort, n = 10)
colnames(topten)[1] <- 'Player'
topten
## Player HR
## 268 Davis, K 48
## 4 Martinez, J 43
## 460 Gallo, J 40
## 12 Trout, M 39
## 129 Ramirez, J 39
## 37 Arenado, N 38
## 96 Lindor, F 38
## 148 Stanton, G 38
## 34 Machado, M 37
## 51 Story, T 37
Sox <- subset(mlb, team == 'BOS' | team == 'CWS')
boombust <- subset(mlb, strike_outs >= 100 & HR >= 30, select = c(name, strike_outs, HR))
soHR <- data.frame(cbind(mlb$name, mlb$strike_outs, mlb$HR))
colnames(soHR) <- c('Player', 'Strikeouts', 'HR')
mlb$singles <- mlb$H - mlb$doubles - mlb$triples - mlb$HR
mlb$TotalBasesHit <- 0
mlb$TotalBasesHit <- mlb$TotalBasesHit + 1*mlb$singles + 2*mlb$doubles + 3*mlb$triples + 4*mlb$HR
library(ggplot2)
ggplot(mlb, aes(x = strike_outs, y = HR))+
geom_point(size=1, color = "blue")+
ggtitle('Correlation of Home Runs and Strikeouts')+
xlab('Strikeouts')+
ylab('Home Runs')+
geom_smooth(method = 'loess', formula = y~x, color = 'green')+
geom_smooth(method = 'lm', formula = y~x, color = 'red')+
theme(plot.title = element_text(hjust = 0.5))
ggplot(boombust, aes(x = strike_outs, y = HR))+
geom_point(size=3, color = "red")+
ggtitle('Correlation of Home Runs and Strikeouts for Boom/Bust Players')+
xlab('Strikeouts')+
ylab('Home Runs')+
geom_smooth()+
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mlb, aes(x = strike_outs, y = H))+
geom_point(size=1, color = "cyan")+
ggtitle('Correlation of Hits and Strikeouts')+
xlab('Strikeouts')+
ylab('Hits')+
geom_smooth()+
theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mlb, aes(x=position, y=H))+
geom_boxplot(outlier.color = 'red', outlier.size = 1)+
xlab('Position')+
ylab('Hits')+
ggtitle('Hits by Position')+
theme(plot.title = element_text(hjust = 0.5))
ggplot(mlb, aes(x=AVG))+
geom_histogram(aes(y=..density..), color = 'black', fill = 'deepskyblue')+
geom_density(alpha=.2, fill = 'red')+
ggtitle('League Batting Average')+
theme(plot.title = element_text(hjust = 0.5))+
xlab('Batting Average')+
ylab('Density')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the scatter plot, it is easy to see that there is a nearly linear relationship between home runs and strikeouts, on average. From this, it can be concluded that an increase in strikeouts may indicate an increase in home runs, and an increase in home runs is even more indicative of an increase in strikeouts.
Interestingly, hits and strikeouts are much less correlated than home runs and strikeouts. There is a much less linear relationship between the two and the data is far more variable.
I also wanted to see how hits vary by position. The most variation is found in the Shortstop position. Pitchers have the least variation in their success levels. Designated Hitters are the most successful batters on average.
The histogram shows the general distribution of batting averages across the league. Most batters fall between 0.2 and 0.3.