Milestone 4

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ stringr 1.4.0
## ✓ tidyr   1.1.4     ✓ forcats 0.5.1
## ✓ readr   2.1.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
library(broom)

Main Data Set

pitches <- read_csv("Downloads/pitches.csv")

## Rows: 2867154 Columns: 40

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): code, type, pitch_type
## dbl (37): px, pz, start_speed, end_speed, spin_rate, spin_dir, break_angle, ...

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

pitch <- pitches

pitches <- data.frame(
  px=pitch$px, pz=pitch$pz,
  start_speed=pitch$start_speed,
  spin_rate=pitch$spin_rate,
  spin_dir=pitch$spin_dir,
  break_angle=pitch$break_angle,
  break_length=pitch$break_length,
  break_y=pitch$break_y,
  type_confidence=pitch$type_confidence,
  pfx_x=pitch$pfx_x,
  pfx_z=pitch$pfx_z,
  nasty=pitch$nasty,
  zone=pitch$zone,
  code=pitch$code,
  type=pitch$type,
  pitch_type=pitch$pitch_type,
  b_count=pitch$b_count,
  s_count=pitch$s_count,
  outs=pitch$outs,
  pitch_num=pitch$pitch_num,
  on_1b=pitch$on_1b,
  on_2b=pitch$on_2b,
  on_3b=pitch$on_3b
)

# Only Last Pitch in AB w/ Certainty in Call
last_pitches <- pitches %>%
  filter(code=="X"|code=="D"|code=="E"|code=="H",type_confidence==2) %>%
  na.omit(pitches)

unique(last_pitches$pitch_type)

##  [1] "FF" "FC" "FT" "SI" "CH" "SL" "CU" "KC" "FS" "KN" "EP" "FA" "FO" "SC"

unique(last_pitches$type)

## [1] "X" "B"

#last_pitches$pitch_type<-factor(last_pitches$pitch_type, levels=orderPitch$pitch_type[order(orderPitch$avg)])

last_pitches <- last_pitches %>%
  filter(break_length<50000)

#last_pitches$pitch_type<-factor(last_pitches$pitch_type, levels=orderPitch$pitch_type[order(orderPitch$avg)])

Describe the units for these variables and for the categorical variable describe the levels.

Response Variable: start_speed Units: MPH The initial speed of a pitch is measured in mph, or miles per hour.

Categorical Predictor: s_count Levels: 3 Strike count has three possible values: 0, 1, or 2

Numeric Predictor:spin_rate Units:RPM The spin rate is measure in rpm, or revolutions per minute.

str(last_pitches)

## 'data.frame':    308773 obs. of  23 variables:
##  $ px             : num  0.627 -0.286 0.008 0.2 -0.432 -0.389 -0.22 0.487 0.088 -0.382 ...
##  $ pz             : num  2.4 1.83 2.6 2.21 2.44 ...
##  $ start_speed    : num  92.9 92.6 87.5 87.5 86.8 87.5 87.2 89.2 91.1 88.8 ...
##  $ spin_rate      : num  2744 2475 1308 846 1081 ...
##  $ spin_dir       : num  148 137 167 144 173 ...
##  $ break_angle    : num  -45.7 -39 -8.1 -10.3 0 2.1 -13 5.3 -42.9 -6.2 ...
##  $ break_length   : num  3.7 4.8 5.4 6.8 5.7 5.8 5.3 4.5 4.5 5.7 ...
##  $ break_y        : num  23.7 23.7 23.8 23.8 23.9 23.8 23.8 23.8 23.7 23.8 ...
##  $ type_confidence: num  2 2 2 2 2 2 2 2 2 2 ...
##  $ pfx_x          : num  7.32 8.56 1.56 2.65 0.75 0.28 2.8 -1.55 8.59 2.36 ...
##  $ pfx_z          : num  11.72 9.19 6.73 3.6 5.64 ...
##  $ nasty          : num  42 48 18 32 38 46 34 38 39 40 ...
##  $ zone           : num  6 13 5 5 4 4 5 9 5 7 ...
##  $ code           : chr  "X" "E" "D" "X" ...
##  $ type           : chr  "X" "X" "X" "X" ...
##  $ pitch_type     : chr  "FF" "FF" "FC" "FC" ...
##  $ b_count        : num  2 2 1 2 0 0 2 0 0 2 ...
##  $ s_count        : num  2 0 0 1 2 0 0 1 0 2 ...
##  $ outs           : num  0 1 0 2 1 2 0 1 0 0 ...
##  $ pitch_num      : num  6 3 2 4 3 1 3 2 1 7 ...
##  $ on_1b          : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ on_2b          : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ on_3b          : num  0 0 0 1 0 1 0 0 0 0 ...

B) (10 points) Fit a simple linear model with a response variable and the numeric predictor that you chose. Does the relationship appear to be significant?

Make sure to also include a graphic.

The p-value of the model is significantly smaller than .05, so the model is significant.

pitchMod <- lm(start_speed~spin_rate, last_pitches)
summary(pitchMod)

## 
## Call:
## lm(formula = start_speed ~ spin_rate, data = last_pitches)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.266  -2.404   0.859   3.252  15.874 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.039e+01  2.378e-02  3380.9   <2e-16 ***
## spin_rate   4.752e-03  1.269e-05   374.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.78 on 308771 degrees of freedom
## Multiple R-squared:  0.3125, Adjusted R-squared:  0.3124 
## F-statistic: 1.403e+05 on 1 and 308771 DF,  p-value: < 2.2e-16

ggplot(last_pitches, aes(x = spin_rate, y = start_speed)) +
  geom_point() +
  ggtitle("Velocity vs Spin Rate") +
  theme_bw() +
  xlab("Spin Rate") +
  ylab("Velocity") +
  geom_abline(slope = .004752, intercept = 80.385297)

C) (5 points) Now, write the “dummy” variable coding for your categorical variable. (Hint: the contrasts() function might help).

# Make s_count categorical
last_pitches$num_s <- as.factor(ifelse(last_pitches$s_count == 0, '0S',
                                ifelse(last_pitches$s_count == 1, '1S',
                                ifelse(last_pitches$s_count == 2, '2S', NA))))
summary(last_pitches$num_s)

##     0S     1S     2S 
##  87578 105237 115958

contrasts(last_pitches$num_s)

##    1S 2S
## 0S  0  0
## 1S  1  0
## 2S  0  1

D) (15 points) Fit a linear model with a response variable and a categorical explanatory variable. Does it appear that there are differences among the means of

levels of the categorical variable? (Hint: Look at the ANOVA F-test). Be sure to include an appropriate graphic (i.e. side-by-side boxplot)

The F-value in the ANOVA is 1197.7. A combination of a high F-value and a low P-value leads us to the conclusion that there are differences among the means of the levels of the categorical variable.

pitchMod2 <- lm(start_speed~num_s, last_pitches)
anova(pitchMod2)

## Analysis of Variance Table
## 
## Response: start_speed
##               Df   Sum Sq Mean Sq F value    Pr(>F)    
## num_s          2    79000   39500  1197.7 < 2.2e-16 ***
## Residuals 308770 10182991      33                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ggplot(last_pitches, aes(y=start_speed, x=num_s, fill=num_s))+
      geom_boxplot()

E) (15 points) Now fit a multiple linear model that combines parts (b) and (d), with both the numeric and categorical variables. What are the estimated models for

the different levels? Include a graphic of the scatter plot with lines overlaid for each level.

Due to the very small variation in the intercepts for the line, the model that was created looks like it is only one line.

mult_mod <- lm(start_speed ~ spin_rate + num_s, last_pitches)
summary(mult_mod)

## 
## Call:
## lm(formula = start_speed ~ spin_rate + num_s, data = last_pitches)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.230  -2.407   0.852   3.251  15.928 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.071e+01  2.861e-02 2820.47   <2e-16 ***
## spin_rate    4.729e-03  1.275e-05  370.80   <2e-16 ***
## num_s1S     -4.372e-01  2.189e-02  -19.97   <2e-16 ***
## num_s2S     -3.532e-01  2.151e-02  -16.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.777 on 308769 degrees of freedom
## Multiple R-squared:  0.3134, Adjusted R-squared:  0.3134 
## F-statistic: 4.699e+04 on 3 and 308769 DF,  p-value: < 2.2e-16

mult_mod$coefficients[2]

##   spin_rate 
## 0.004728792

ggplot(last_pitches, aes(x=spin_rate, y=start_speed, color=num_s))+
  geom_point()+
  geom_abline(intercept = mult_mod$coefficients[1], slope=mult_mod$coefficients[2],
              color="red", lwd=1)+
  geom_abline(intercept = mult_mod$coefficients[1]+mult_mod$coefficients[3], slope=mult_mod$coefficients[2],
             color="forestgreen", lwd=1)+
  geom_abline(intercept = mult_mod$coefficients[1]+mult_mod$coefficients[4], slope=mult_mod$coefficients[2],
              color="blue", lwd=1)

mult_mod$coefficients[1]

## (Intercept) 
##    80.70702

mult_mod$coefficients[1]+mult_mod$coefficients[2]

## (Intercept) 
##    80.71175

mult_mod$coefficients[1]+mult_mod$coefficients[3]

## (Intercept) 
##    80.26985

F) (15 points) Finally, fit a multiple linear model that also includes the interaction between the numeric and categorical variables, which allows for different

slopes. What are the estimated models for the different levels? Include a graphic of the scatter plot with lines overlaid for each level.

mult_mod2 <- lm(start_speed ~ spin_rate*num_s, data = last_pitches)
summary(mult_mod2)

## 
## Call:
## lm(formula = start_speed ~ spin_rate * num_s, data = last_pitches)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.158  -2.406   0.851   3.249  15.962 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        8.075e+01  4.949e-02 1631.732  < 2e-16 ***
## spin_rate          4.704e-03  2.525e-05  186.302  < 2e-16 ***
## num_s1S           -4.118e-01  6.436e-02   -6.399 1.56e-10 ***
## num_s2S           -4.797e-01  6.127e-02   -7.829 4.92e-15 ***
## spin_rate:num_s1S -1.598e-05  3.348e-05   -0.477   0.6331    
## spin_rate:num_s2S  7.314e-05  3.218e-05    2.272   0.0231 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.777 on 308767 degrees of freedom
## Multiple R-squared:  0.3134, Adjusted R-squared:  0.3134 
## F-statistic: 2.819e+04 on 5 and 308767 DF,  p-value: < 2.2e-16

s1 <- mult_mod2$coefficients[2]
s2 <- mult_mod2$coefficients[2]+mult_mod2$coefficients[5]
s3 <- mult_mod2$coefficients[2]+mult_mod2$coefficients[6]
i1 <- mult_mod2$coefficients[1]
i2 <- mult_mod2$coefficients[1]+mult_mod2$coefficients[3]
i3 <- mult_mod2$coefficients[1]+mult_mod2$coefficients[4]

ggplot(data=last_pitches, aes(x=spin_rate, y=start_speed, color=num_s))+
  geom_point()+
  ggtitle("Scatterplot of Spin Rate vs Velocity of Pitches")+
  theme_bw()+
  geom_abline(slope=s1, intercept = i1, col=2)+
  geom_abline(slope=s2, intercept = i2, col=4)+
  geom_abline(slope=s3, intercept = i3, col=6)

G) (15 points) Compare the models from parts (B), (D), (E), and (F).

○ Calculate the MSEs

○ Discuss model differences

MSEs: B: 23 D: 33 E: 23 F: 23

The first model shows how spin rate alone affects start speed. This model gives us an adjusted r-squared of 0.3124, a y-intercept of 80.39, and a slope of 0.004752. The second model shows start speed based on how many strikes there are in the count. This model is the most distinctive of our four because it uses a categorical explanatory variable. Furthermore, it is the only model without an MSE of 23, and instead has a higher value of 33, meaning that the model is worse at predicting actual values than any of the other three. Our third model shows how spin rate and number of strikes, the categorical variable in our second model, affect the start speed of pitches. While this model has the same MSE as the original model that only considered spin rate, it has a marginally higher adjusted r-squared value of 0.3134, and a marginally higher y-intercept of 80.71. Additionally, the model has a slightly lower slope of 0.004729. Our final model tests how start speed is affected by spin rate, the number of strikes in the count, as well as the interaction between these two variables. This model is extremely similar our first and third models, and shares with them the MSE of 23. Subtle differences can be found when observing the y-intercept, which is slightly higher than both models at 80.75, and the slope, which is slightly lower than both other models at 0.004704. It shares an adjusted r-squared value of 0.3134 with the model that doesn’t consider the interaction between spin rate and the number of strikes.

anova(pitchMod)

## Analysis of Variance Table
## 
## Response: start_speed
##               Df  Sum Sq Mean Sq F value    Pr(>F)    
## spin_rate      1 3206362 3206362  140318 < 2.2e-16 ***
## Residuals 308771 7055628      23                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(pitchMod2)

## Analysis of Variance Table
## 
## Response: start_speed
##               Df   Sum Sq Mean Sq F value    Pr(>F)    
## num_s          2    79000   39500  1197.7 < 2.2e-16 ***
## Residuals 308770 10182991      33                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(mult_mod)

## Analysis of Variance Table
## 
## Response: start_speed
##               Df  Sum Sq Mean Sq   F value    Pr(>F)    
## spin_rate      1 3206362 3206362 140516.47 < 2.2e-16 ***
## num_s          2   10011    5006    219.37 < 2.2e-16 ***
## Residuals 308769 7045617      23                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(mult_mod2)

## Analysis of Variance Table
## 
## Response: start_speed
##                     Df  Sum Sq Mean Sq    F value    Pr(>F)    
## spin_rate            1 3206362 3206362 1.4052e+05 < 2.2e-16 ***
## num_s                2   10011    5006 2.1937e+02 < 2.2e-16 ***
## spin_rate:num_s      2     234     117 5.1364e+00  0.005879 ** 
## Residuals       308767 7045383      23                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H) (15 points) Conclusion: What did you learn from this exercise? Were any of the relationships significant? (Note: This would be great to include in your final

project write up!)

Due to the values in the pitches dataset being so similar, the slight variations are difficult to display visually. However, the results that were shown weren’t quite what we were expecting. Specifically, pitchers actually throw hardest when there are no strikes in the count. This is surprising because a plot we made during an earlier milestone that considered only fastballs suggested that pitchers throw the hardest the further they are ahead in the count. This discrepancy could be attributed to a number of different reasons. Most likely it is due simply to pitch choice. A common tendency among pitchers is to throw fastballs earlier in counts, then throw off-speed pitches once they are ahead to try and get swings-and-misses and strikeouts. This means that our data is most likely not saying that pitchers throw harder earlier in counts, and rather, is saying that pitchers throw pitches that tend to be faster (fastballs) earlier in counts. Additionally, the mean start speeds for each amount of strikes in the count are so similar and our data set was so vast that adding this variable to our model and adding its interaction with spin rate to our model both affected our model very little.