Best Fit Lines

Sports Data Science

The DIPS data set

 [1] "pitcher"            "Y"                  "Ym1"               
 [4] "Ym1_IP"             "Ym1_strikeout_rate" "Ym1_walk_rate"     
 [7] "Ym1_homerun_rate"   "Ym1_BABIP"          "Y_IP"              
[10] "Y_strikeout_rate"   "Y_walk_rate"        "Y_homerun_rate"    
[13] "Y_BABIP"           

Strikeout rates, consecutive years - Basic

plot(dips$Ym1_strikeout_rate, dips$Y_strikeout_rate)

plot of chunk unnamed-chunk-3

Strikeout rates, consecutive years - ggplot

ggplot(dips, aes(Ym1_strikeout_rate, Y_strikeout_rate)) + geom_point()

plot of chunk unnamed-chunk-4

Converting to z-scores

mean(dips$Ym1_strikeout_rate); mean(dips$Y_strikeout_rate)
[1] 0.1676127
[1] 0.162957
sd(dips$Ym1_strikeout_rate); sd(dips$Y_strikeout_rate)
[1] 0.04591527
[1] 0.04441072

Converting to z-scores: function

zScore <- function(x){
  (x-mean(x))/sd(x)
}

Converting to z-scores - mutate

dips <- dips %>% mutate(Y_zK = zScore(Y_strikeout_rate), Ym1_zK = zScore(Ym1_strikeout_rate))

Strikeout rates, consecutive years - Basic

mean(dips$Ym1_zK); mean(dips$Y_zK)
[1] 5.54321e-17
[1] 4.737552e-17
sd(dips$Ym1_zK); sd(dips$Y_zK)
[1] 1
[1] 1

Strikeout rates, consecutive years - ggplot

ggplot(dips, aes(Ym1_zK, Y_zK))+geom_point()+ geom_smooth(method="lm")

plot of chunk unnamed-chunk-9

What's the equatom for this line?

lm(Y_zK ~ Ym1_zK, data=dips)

Call:
lm(formula = Y_zK ~ Ym1_zK, data = dips)

Coefficients:
(Intercept)       Ym1_zK  
 -1.612e-18    7.454e-01  

\[ y = m \cdot x + b \] \[ Y zK = m \cdot Ym1 zK + b \] \[ Y zK = 0.745 \cdot Ym1 zK + 0 \]

Correlation = slope of z-scores best-fit lines

lm(Y_zK ~ Ym1_zK, data=dips)

Call:
lm(formula = Y_zK ~ Ym1_zK, data = dips)

Coefficients:
(Intercept)       Ym1_zK  
 -1.612e-18    7.454e-01  

... continued

cor(dips$Y_zK, dips$Ym1_zK)
[1] 0.7453852
cor(dips$Y_strikeout_rate, dips$Ym1_strikeout_rate)
[1] 0.7453852

slope (m) = r*sd(y)/sd(x)

lm(Ym1_strikeout_rate ~ Y_strikeout_rate, data=dips)

Call:
lm(formula = Ym1_strikeout_rate ~ Y_strikeout_rate, data = dips)

Coefficients:
     (Intercept)  Y_strikeout_rate  
         0.04203           0.77064  

slope (m) = r*sd(y)/sd(x)

cor(dips$Y_strikeout_rate, dips$Ym1_strikeout_rate)
[1] 0.7453852
cor(dips$Y_strikeout_rate, dips$Ym1_strikeout_rate)*sd(dips$Ym1_strikeout_rate)/sd(dips$Y_strikeout_rate)
[1] 0.7706374

The best-fit line passes through both means

So, we know the slope, r*sd(y)/sd(x), and we know one point the line pases through, (mean(x), mean(y)).

This is enough information to get the equation of the best-fit line.

Deriving b, the intercept

Note: \( \hat{y} = \) the y value predicted by the best-fit line

\[ \hat{y} = r \cdot \frac{sd(y)}{sd(x)} x + b \]

\[ m = r \cdot \frac{sd(y)}{sd(x)} \]

\[ \hat{y} - \bar{y} = m \cdot (x - \bar{x}) \]

Note: \( \bar{y} = mean(y) \) and \( \bar{x} = mean(x) \)

\[ \hat{y} = m \cdot x + \bar{y} - m \cdot \bar{x} \]

\[ b = \bar{y} - m \cdot \bar{x} \]

What does best-fit mean?

The best-fit line minimizes the sum of the squared errors (and also the RMSE).

These errors are also called residuals:

\( residual_i = y_i - \hat{y} \)

SSE (sum of the squared errors) \( = \Sigma (y_i - \hat{y})^2 \)

Remember the elevators?

How do we use this?

     (Intercept) Y_strikeout_rate 
      0.04203193       0.77063736 

If someone struck out 10% of batters one year, what do we expect the next year?

0.04203 + 0.77064 * 0.10
[1] 0.119094

And if they struck out 20% of batters?

0.04203 + 0.77064 * 0.25
[1] 0.23469

And, if we're lazy (as we should be)...

m <- lm(Ym1_strikeout_rate ~ Y_strikeout_rate, data=dips)

predict(m, list(Y_strikeout_rate=c(0.10, 0.25)))
        1         2 
0.1190957 0.2346913