Sports Data Science
[1] "pitcher" "Y" "Ym1"
[4] "Ym1_IP" "Ym1_strikeout_rate" "Ym1_walk_rate"
[7] "Ym1_homerun_rate" "Ym1_BABIP" "Y_IP"
[10] "Y_strikeout_rate" "Y_walk_rate" "Y_homerun_rate"
[13] "Y_BABIP"
plot(dips$Ym1_strikeout_rate, dips$Y_strikeout_rate)
ggplot(dips, aes(Ym1_strikeout_rate, Y_strikeout_rate)) + geom_point()
mean(dips$Ym1_strikeout_rate); mean(dips$Y_strikeout_rate)
[1] 0.1676127
[1] 0.162957
sd(dips$Ym1_strikeout_rate); sd(dips$Y_strikeout_rate)
[1] 0.04591527
[1] 0.04441072
zScore <- function(x){
(x-mean(x))/sd(x)
}
dips <- dips %>% mutate(Y_zK = zScore(Y_strikeout_rate), Ym1_zK = zScore(Ym1_strikeout_rate))
mean(dips$Ym1_zK); mean(dips$Y_zK)
[1] 5.54321e-17
[1] 4.737552e-17
sd(dips$Ym1_zK); sd(dips$Y_zK)
[1] 1
[1] 1
ggplot(dips, aes(Ym1_zK, Y_zK))+geom_point()+ geom_smooth(method="lm")
lm(Y_zK ~ Ym1_zK, data=dips)
Call:
lm(formula = Y_zK ~ Ym1_zK, data = dips)
Coefficients:
(Intercept) Ym1_zK
-1.612e-18 7.454e-01
\[ y = m \cdot x + b \] \[ Y zK = m \cdot Ym1 zK + b \] \[ Y zK = 0.745 \cdot Ym1 zK + 0 \]
lm(Y_zK ~ Ym1_zK, data=dips)
Call:
lm(formula = Y_zK ~ Ym1_zK, data = dips)
Coefficients:
(Intercept) Ym1_zK
-1.612e-18 7.454e-01
cor(dips$Y_zK, dips$Ym1_zK)
[1] 0.7453852
cor(dips$Y_strikeout_rate, dips$Ym1_strikeout_rate)
[1] 0.7453852
lm(Ym1_strikeout_rate ~ Y_strikeout_rate, data=dips)
Call:
lm(formula = Ym1_strikeout_rate ~ Y_strikeout_rate, data = dips)
Coefficients:
(Intercept) Y_strikeout_rate
0.04203 0.77064
cor(dips$Y_strikeout_rate, dips$Ym1_strikeout_rate)
[1] 0.7453852
cor(dips$Y_strikeout_rate, dips$Ym1_strikeout_rate)*sd(dips$Ym1_strikeout_rate)/sd(dips$Y_strikeout_rate)
[1] 0.7706374
So, we know the slope, r*sd(y)/sd(x), and we know one point the line pases through, (mean(x), mean(y)).
This is enough information to get the equation of the best-fit line.
Note: \( \hat{y} = \) the y value predicted by the best-fit line
\[ \hat{y} = r \cdot \frac{sd(y)}{sd(x)} x + b \]
\[ m = r \cdot \frac{sd(y)}{sd(x)} \]
\[ \hat{y} - \bar{y} = m \cdot (x - \bar{x}) \]
Note: \( \bar{y} = mean(y) \) and \( \bar{x} = mean(x) \)
\[ \hat{y} = m \cdot x + \bar{y} - m \cdot \bar{x} \]
\[ b = \bar{y} - m \cdot \bar{x} \]
The best-fit line minimizes the sum of the squared errors (and also the RMSE).
These errors are also called residuals:
\( residual_i = y_i - \hat{y} \)
SSE (sum of the squared errors) \( = \Sigma (y_i - \hat{y})^2 \)
Remember the elevators?
(Intercept) Y_strikeout_rate
0.04203193 0.77063736
If someone struck out 10% of batters one year, what do we expect the next year?
0.04203 + 0.77064 * 0.10
[1] 0.119094
And if they struck out 20% of batters?
0.04203 + 0.77064 * 0.25
[1] 0.23469
m <- lm(Ym1_strikeout_rate ~ Y_strikeout_rate, data=dips)
predict(m, list(Y_strikeout_rate=c(0.10, 0.25)))
1 2
0.1190957 0.2346913