Chapter X: Exploring Pitch Data with R (datacamp)

X.1 Exploring Pitch Velocities

Question

Velocity is a key component in the arsenal of many pitchers. Was there an uptick in Zack Greinke’s velocity during his impressive July in 2015?

Import the greinke dataset.

greinke <- read.csv("data/greinke.csv")

X.1.1 Clean the data

Ensure the data are clean. The greinke dataset contains information about every pitch thrown by Zack Greinke during the 2015 regular season. These data include over 3,200 individual pitches and 29 variables associated with the game date, inning, location, velocity, movement, pitch type, pitch results, at bat results, and more.

# Print the first 6 rows of the data
head(greinke)
##         p_name pitcher_id batter_stand pitch_type    pitch_result
## 1 Zack Greinke     425844            R         FF            Ball
## 2 Zack Greinke     425844            R         FF Swinging Strike
## 3 Zack Greinke     425844            R         FF   Called Strike
## 4 Zack Greinke     425844            R         SL Swinging Strike
## 5 Zack Greinke     425844            R         FF Swinging Strike
## 6 Zack Greinke     425844            R         SL Swinging Strike
##   atbat_result start_speed    z0     x0  pfx_x  pfx_z    px    pz
## 1         Walk        94.2 5.997 -0.675 -4.457  9.760 1.714 1.925
## 2       Single        92.4 6.281 -0.760 -1.590 11.400 0.589 3.271
## 3     Home Run        92.7 6.168 -0.958 -1.884  9.245 0.399 2.918
## 4    Strikeout        86.9 6.077 -0.939  3.594  0.762 0.764 1.306
## 5    Strikeout        92.8 6.107 -0.524 -0.558 11.134 1.517 2.193
## 6    Strikeout        87.8 6.321 -0.948  4.313  0.132 0.695 3.431
##   break_angle break_length spin_rate spin_dir balls strikes outs game_date
## 1        24.8          3.5  2188.802  204.457     2       2    2 10/3/2015
## 2        10.1          2.7  2312.202  187.913     1       1    0 10/3/2015
## 3         9.2          3.5  1889.841  191.468     0       0    1 10/3/2015
## 4       -11.4          8.0   693.649  102.648     1       2    0 10/3/2015
## 5        -0.4          2.8  2242.916  182.859     1       2    0 10/3/2015
## 6       -13.6          7.8   828.693   92.330     2       2    1 10/3/2015
##   inning inning_topbot batted_ball_type batted_ball_velocity   hc_x  hc_y
## 1      4           top                                    NA   0.00  0.00
## 2      3           top                                   104 123.56 97.26
## 3      5           top                                   103  50.88 31.17
## 4      6           top                                    NA   0.00  0.00
## 5      8           top                                    NA   0.00  0.00
## 6      1           top                                    NA   0.00  0.00
##   pitch_id distance_feet
## 1      160            NA
## 2       95             0
## 3      218           425
## 4      265            NA
## 5      374            NA
## 6       14            NA
# Print the number of rows in the data frame
nrow(greinke)
## [1] 3239
# Summarize the start_speed variable
summary(greinke$start_speed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   52.20   87.30   89.80   88.44   91.80   95.40       3
# Get rid of data without start_speed
greinke <- subset(greinke, !is.na(start_speed))

# Print the number of complete entries
nrow(greinke)
## [1] 3236
# Print the structure of greinke
str(greinke)
## 'data.frame':    3236 obs. of  29 variables:
##  $ p_name              : Factor w/ 1 level "Zack Greinke": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pitcher_id          : int  425844 425844 425844 425844 425844 425844 425844 425844 425844 425844 ...
##  $ batter_stand        : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
##  $ pitch_type          : Factor w/ 8 levels "","CH","CU","EP",..: 5 5 5 8 5 8 2 5 8 8 ...
##  $ pitch_result        : Factor w/ 15 levels "Ball","Ball In Dirt",..: 1 14 3 14 14 14 15 3 4 14 ...
##  $ atbat_result        : Factor w/ 24 levels "Bunt Groundout",..: 24 20 12 21 21 21 21 21 10 24 ...
##  $ start_speed         : num  94.2 92.4 92.7 86.9 92.8 87.8 90.3 92.7 85.5 87.3 ...
##  $ z0                  : num  6 6.28 6.17 6.08 6.11 ...
##  $ x0                  : num  -0.675 -0.76 -0.958 -0.939 -0.524 ...
##  $ pfx_x               : num  -4.457 -1.59 -1.884 3.594 -0.558 ...
##  $ pfx_z               : num  9.76 11.4 9.245 0.762 11.134 ...
##  $ px                  : num  1.714 0.589 0.399 0.764 1.517 ...
##  $ pz                  : num  1.93 3.27 2.92 1.31 2.19 ...
##  $ break_angle         : num  24.8 10.1 9.2 -11.4 -0.4 -13.6 22.5 25.1 -8.4 -11.3 ...
##  $ break_length        : num  3.5 2.7 3.5 8 2.8 7.8 7.4 3.8 7.5 7.4 ...
##  $ spin_rate           : num  2189 2312 1890 694 2243 ...
##  $ spin_dir            : num  204 188 191 103 183 ...
##  $ balls               : int  2 1 0 1 1 2 1 0 0 0 ...
##  $ strikes             : int  2 1 0 2 2 2 2 2 0 1 ...
##  $ outs                : int  2 0 1 0 0 1 1 2 2 2 ...
##  $ game_date           : Factor w/ 32 levels "10/3/2015","4/12/2015",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ inning              : int  4 3 5 6 8 1 6 5 8 4 ...
##  $ inning_topbot       : Factor w/ 2 levels "bot","top": 2 2 2 2 2 2 2 2 2 2 ...
##  $ batted_ball_type    : Factor w/ 5 levels "","FB","GB","LD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ batted_ball_velocity: int  NA 104 103 NA NA NA NA NA NA NA ...
##  $ hc_x                : num  0 123.6 50.9 0 0 ...
##  $ hc_y                : num  0 97.3 31.2 0 0 ...
##  $ pitch_id            : int  160 95 218 265 374 14 279 231 386 156 ...
##  $ distance_feet       : int  NA 0 425 NA NA NA NA NA NA NA ...

X.1.2 Check dates

Convert the game_date variable to a date. Because you’ll use the time dimension for some exploration later on, you want R to know that these are dates.

# Check if dates are formatted as dates
class(greinke$game_date)
## [1] "factor"
# Change them to dates
greinke$game_date <- as.Date(greinke$game_date, format = "%m/%d/%Y")

# Check that the variable is now formatted as a date
class(greinke$game_date)
## [1] "Date"

X.1.3 Delimit dates

Split up the dates so that you have a month, day, and year variable in the data. This will help you group the data for later analyses.

#The following needs to be inserted. Datacamp did this in the background.
##########################################################
# Load the tidyr package
library(tidyr)
###########################################################

# Separate game_date into "year", "month", and "day"
greinke <- separate(data = greinke, col = game_date,
                    into = c("year", "month", "day"),
                    sep = "-", remove = FALSE)

# Convert month to numeric
greinke$month <- as.numeric(greinke$month)

# Create the july variable
greinke$july <- ifelse(greinke$month == 7, "july", "other")

# View the head() of greinke
head(greinke)
##         p_name pitcher_id batter_stand pitch_type    pitch_result
## 1 Zack Greinke     425844            R         FF            Ball
## 2 Zack Greinke     425844            R         FF Swinging Strike
## 3 Zack Greinke     425844            R         FF   Called Strike
## 4 Zack Greinke     425844            R         SL Swinging Strike
## 5 Zack Greinke     425844            R         FF Swinging Strike
## 6 Zack Greinke     425844            R         SL Swinging Strike
##   atbat_result start_speed    z0     x0  pfx_x  pfx_z    px    pz
## 1         Walk        94.2 5.997 -0.675 -4.457  9.760 1.714 1.925
## 2       Single        92.4 6.281 -0.760 -1.590 11.400 0.589 3.271
## 3     Home Run        92.7 6.168 -0.958 -1.884  9.245 0.399 2.918
## 4    Strikeout        86.9 6.077 -0.939  3.594  0.762 0.764 1.306
## 5    Strikeout        92.8 6.107 -0.524 -0.558 11.134 1.517 2.193
## 6    Strikeout        87.8 6.321 -0.948  4.313  0.132 0.695 3.431
##   break_angle break_length spin_rate spin_dir balls strikes outs
## 1        24.8          3.5  2188.802  204.457     2       2    2
## 2        10.1          2.7  2312.202  187.913     1       1    0
## 3         9.2          3.5  1889.841  191.468     0       0    1
## 4       -11.4          8.0   693.649  102.648     1       2    0
## 5        -0.4          2.8  2242.916  182.859     1       2    0
## 6       -13.6          7.8   828.693   92.330     2       2    1
##    game_date year month day inning inning_topbot batted_ball_type
## 1 2015-10-03 2015    10  03      4           top                 
## 2 2015-10-03 2015    10  03      3           top                 
## 3 2015-10-03 2015    10  03      5           top                 
## 4 2015-10-03 2015    10  03      6           top                 
## 5 2015-10-03 2015    10  03      8           top                 
## 6 2015-10-03 2015    10  03      1           top                 
##   batted_ball_velocity   hc_x  hc_y pitch_id distance_feet  july
## 1                   NA   0.00  0.00      160            NA other
## 2                  104 123.56 97.26       95             0 other
## 3                  103  50.88 31.17      218           425 other
## 4                   NA   0.00  0.00      265            NA other
## 5                   NA   0.00  0.00      374            NA other
## 6                   NA   0.00  0.00       14            NA other
# Print a summary of the july variable
summary(factor(greinke$july))
##  july other 
##   524  2712

X.1.4 Velocity distribution

Now that the data are ready to go, look at the distribution of start_speed (velocity in miles per hour) of Greinke’s pitches.

# Make a histogram of Greinke's start speed
hist(greinke$start_speed)

# Create greinke_july
greinke_july <- subset(greinke, july == "july")

# Create greinke_other
greinke_other <- subset(greinke, july == "other")

# Use par to format your plot layout
par(mfrow = c(1, 2))

# Plot start_speed histogram from july
hist(greinke_july$start_speed)

# Plot start_speed histogram for other months
hist(greinke_other$start_speed)

Interpretation

  • It looks like the distribution for July looks very similar to that of all other months– strongly skewed to the left.
  • The distribution is strongly skewed. This is because you have included both offspeed pitches and fastballs. Offspeed pitches tend to have lower velocities, so the distribution has a long tail.

X.1.5 Fastball velocity distribution

Look only at four-seam fastballs (FF) in order to evaluate the distribution of the hardest pitch thrown. Doing so will allow you to understand whether his ability to throw at a higher velocity was different in July relative to other months.

# Create july_ff
july_ff <- subset(greinke_july, pitch_type == "FF")

# Create other_ff
other_ff <- subset(greinke_other, pitch_type == "FF")

# Formatting code, don't change this
par(mfrow = c(1, 2))

# Plot histogram of July fastball speeds
hist(july_ff$start_speed)

# Plot histogram of other month fastball speeds
hist(other_ff$start_speed)

Interpretation

  • The distributions look more symmetrical now.
  • Plotted side-by-side, however, it’s hard to tell whether Greinke’s four-seam fastballs were on average slower or faster in July compared to those from all other months.

X.1.6 Distribution comparisons with color

Plot overlapping one another with different colors, in order to get more insight.

# Make a fastball speed histogram for other months
hist(other_ff$start_speed,
     col = "#00009950", 
     freq = FALSE,                   
     ylim = c(0, .35), xlab = "Velocity (mph)",
     main = "Greinke 4-Seam Fastball Velocity")

#col = "#rrggbbaa". rr for red; gg for green; bb for blue; and aa for transparency. All four are specified on a scale from 00 to 99.

# Add a histogram for July
hist(july_ff$start_speed,
     col = "#99000050", 
     freq = FALSE,      #to plot a density not frequencies
     add = TRUE)        #so it doesn't overwrite the first plot

# Draw vertical line at the mean of other_ff
abline(v = mean(other_ff$start_speed), col = "#00009950", lwd = 2)

# Draw vertical line at the mean of july_ff
abline(v = mean(july_ff$start_speed), col = "#99000050", lwd = 2)

Interpretation

  • In July, the distribution of Greinke’s velocities is shifted right, and he had a higher average start_speed compared to other months.

X.1.7 tapply() for velocity changes

See if there are differences in start_speed between July and other months

  • for all pitch types,
  • then for four-seam fastballs only.

tapply() allows you to apply functions to a continuous variable for each group of a factor variable.

# Summarize velocity in July and other months
tapply(greinke$start_speed, greinke$july, mean)
##     july    other 
## 88.86489 88.35601
# Create greinke_ff
greinke_ff <- subset(greinke, pitch_type == "FF")

# Calculate mean fastball velocities: ff_velo_month
ff_velo_month <- tapply(greinke_ff$start_speed, greinke_ff$july, mean)

# Print ff_velo_month
ff_velo_month
##     july    other 
## 92.42077 91.66474

Interpretation

  • This is consistent with the histogram we created earlier.

X.1.8 Game-by-game velocity changes

Create a new data frame with ave. velocity per game_date.

# Create ff_dt
ff_dt <- data.frame(tapply(greinke_ff$start_speed, greinke_ff$game_date, mean))

# Print the first 6 rows of ff_dt
head(ff_dt)
##            tapply.greinke_ff.start_speed..greinke_ff.game_date..mean.
## 2015-04-07                                                   90.82632
## 2015-04-12                                                   90.51622
## 2015-04-18                                                   90.28654
## 2015-04-24                                                   90.51277
## 2015-04-29                                                   90.40732
## 2015-05-05                                                   90.33043

X.1.9 Tidying the data frame

Include the row names as a variable in ff_dt, formatted as dates.

# Create game_date in ff_dt
ff_dt$game_date <- as.Date(row.names(ff_dt), format = "%Y-%m-%d")

# Rename the first column
colnames(ff_dt)[1] <- "start_speed"

# Remove row names
row.names(ff_dt) <- NULL

# View head of ff_dt
head(ff_dt)
##   start_speed  game_date
## 1    90.82632 2015-04-07
## 2    90.51622 2015-04-12
## 3    90.28654 2015-04-18
## 4    90.51277 2015-04-24
## 5    90.40732 2015-04-29
## 6    90.33043 2015-05-05

X.1.10 A game-by-game line plot

Plot Zack Greinke’s fastball velocity on a game-by-game basis using a line plot.

# Plot game-by-game 4-seam fastballs
plot(ff_dt$start_speed ~ ff_dt$game_date,
     lwd = 4, type = "l", ylim = c(88, 95),
     main = "Greinke 4-Seam Fastball Velocity",
     xlab = "Date", ylab = "Velocity (mph)")

Interpretation

  • Velocity seems to be creeping upward throughout the season for Greinke.

X.1.11 Adding jittered points

Plot individual pitches along with the game average line.

# Code from last exercise, don't change this
plot(ff_dt$start_speed ~ ff_dt$game_date,
     lwd = 4, type = "l", ylim = c(88, 95),
     main = "Greinke 4-Seam Fastball Velocity",
     xlab = "Date", ylab = "Velocity (mph)")

# Add jittered points to the plot
points(greinke_ff$start_speed ~ jitter(as.numeric(greinke_ff$game_date)), 
       pch = 16,          #to create solid circular points
       col = "#99004450")

Interpretation

  • There seems to be a pretty wide distribution of fastball velocities within each game.

X.2 Exploring Pitch Types

X.2.1 Pitch mix tables

So far, you have focused mainly on velocity of pitches, but there are other characteristics that make a pitcher successful: pitch mix and location. For example, perhaps the increased velocity led to Greinke taking advantage of his fastball a bit more. Or maybe it made his off-speed pitches more effective, ultimately leading to greater usage of these pitch types.

Greinke’s seven pitch types

  • FF: foreseam fastball
  • FT: twoseam fastball
  • SL: slider
  • CH: changeup
  • CU: curveball
  • EP: effis pitch, ones that are misclassified (remove)
  • IN: intentional balls (remove)

table() versus tapply()

  • table(factor variable you want to summarize, factor grouping variable)
  • tapply(numerical variable you want to summarize, grouping variable, function to apply across each group)
# Subset the data to remove pitch types "IN" and "EP"
greinke <- subset(greinke, pitch_type != "IN" & pitch_type != "EP")

# Drop the levels from pitch_type
greinke$pitch_type <- droplevels(greinke$pitch_type)

# Create type_tab
type_tab <- table(greinke$pitch_type, greinke$july) #1st arg is displayed by 2nd arg

# Print type_tab
type_tab
##     
##      july other
##   CH  112   487
##   CU   51   242
##   FF  207  1191
##   FT   66   255
##   SL   86   535

Interpretation

  • Greinke used his fastball more than any other pitch in both July and other months of 2015.

X.2.2 Pitch mix table using prop.table()

It could be more informative and interpretable if you present this information as proportions.

# Create type_prop table
type_prop <- round(prop.table(type_tab, margin = 2), 3) #margin=2 tells R to compute prop within each column

# Print type_prop
type_prop
##     
##       july other
##   CH 0.215 0.180
##   CU 0.098 0.089
##   FF 0.397 0.439
##   FT 0.126 0.094
##   SL 0.165 0.197

X.2.3 Pitch mix tables - July vs. other

Next, you can compare the velocity to fastball usage overall.

# Create ff_prop
ff_prop <- type_prop[3, ]

# Print ff_prop
ff_prop
##  july other 
## 0.397 0.439
# Print ff_velo_month
ff_velo_month
##     july    other 
## 92.42077 91.66474

Interpreation

  • Greinke threw his 4-seam fastball harder in July than in other months, on average and used it less often.
  • This could indicate that Greinke sees the non-fastball pitches as being more effective when there’s a larger difference in velocity between them and his fastball.

X.2.4 Pitch mix tables - changes in pitch type rates

Now you will directly calculate the difference in pitch type rates for July and the other months.

Prepare type_prop. This was done behind the sceen in datacamp.

  • Convert type_prop, the table object, to a data frame object.
  • Convert row names to a new column, Pitch.
  • Change column names july to July and other to Other.
type_prop
##     
##       july other
##   CH 0.215 0.180
##   CU 0.098 0.089
##   FF 0.397 0.439
##   FT 0.126 0.094
##   SL 0.165 0.197
str(type_prop)
##  table [1:5, 1:2] 0.215 0.098 0.397 0.126 0.165 0.18 0.089 0.439 0.094 0.197
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:5] "CH" "CU" "FF" "FT" ...
##   ..$ : chr [1:2] "july" "other"
type_prop <- as.data.frame.matrix(type_prop)
type_prop
##     july other
## CH 0.215 0.180
## CU 0.098 0.089
## FF 0.397 0.439
## FT 0.126 0.094
## SL 0.165 0.197
Pitch <- rownames(type_prop)
Pitch
## [1] "CH" "CU" "FF" "FT" "SL"
type_prop <- cbind(Pitch, type_prop)
colnames(type_prop) <- paste(c("Pitch", "July", "Other"))
type_prop
##    Pitch  July Other
## CH    CH 0.215 0.180
## CU    CU 0.098 0.089
## FF    FF 0.397 0.439
## FT    FT 0.126 0.094
## SL    SL 0.165 0.197
str(type_prop)
## 'data.frame':    5 obs. of  3 variables:
##  $ Pitch: Factor w/ 5 levels "CH","CU","FF",..: 1 2 3 4 5
##  $ July : num  0.215 0.098 0.397 0.126 0.165
##  $ Other: num  0.18 0.089 0.439 0.094 0.197
# Create the Difference column
type_prop$Difference <- (type_prop$July - type_prop$Other) / type_prop$Other

# Print type_prop
type_prop
##    Pitch  July Other  Difference
## CH    CH 0.215 0.180  0.19444444
## CU    CU 0.098 0.089  0.10112360
## FF    FF 0.397 0.439 -0.09567198
## FT    FT 0.126 0.094  0.34042553
## SL    SL 0.165 0.197 -0.16243655
# Plot a barplot
barplot(type_prop$Difference, names.arg = type_prop$Pitch, 
        main = "Pitch Usage in July vs. Other Months", 
        ylab = "Percentage Change in July", 
        ylim = c(-0.3, 0.3))

Interpretation

  • Greinke decreased his slider (SL) use much more in July than any other pitch.
  • Greinke decreased his foreseam fastball (FF) use as well.
  • Greinke increased his twoseam fastball (FT) use by a large margin in other months.

X.2.5 Ball-strike count frequency

While pitch types are interesting in their own right, it might be more useful to think about what types of pitches Greinke uses in different ball-strike counts.

Make a table to examine the rate at which Greinke throws pitches in each of the ball-strike counts.

# Create bs_table
bs_table <- table(factor(greinke$balls), factor(greinke$strikes))

# Create bs_prop_table
bs_prop_table <- round(prop.table(bs_table), 3)

# Print bs_prop_table
bs_prop_table
##    
##         0     1     2
##   0 0.261 0.135 0.062
##   1 0.095 0.115 0.096
##   2 0.026 0.053 0.093
##   3 0.006 0.015 0.043
# Print row sums
rowSums(bs_prop_table)
##     0     1     2     3 
## 0.458 0.306 0.172 0.064
# Print column sums
colSums(bs_prop_table)
##     0     1     2 
## 0.388 0.318 0.294

Interpretation

  • The proportion of pitches thrown decreases as balls or strikes increases.
  • This makes sense, since the lower order counts have to be passed through in order to get to more balls or more strikes.

X.2.6 Make a new variable

To simplify future exercises, you’ll now create a new variable called bs_count that combines the balls and strikes variables into a single ball-strike count.

# Create bs_count
greinke$bs_count <- paste(greinke$balls, greinke$strikes, sep = "-")

# Print the first 6 rows of greinke
head(greinke)
##         p_name pitcher_id batter_stand pitch_type    pitch_result
## 1 Zack Greinke     425844            R         FF            Ball
## 2 Zack Greinke     425844            R         FF Swinging Strike
## 3 Zack Greinke     425844            R         FF   Called Strike
## 4 Zack Greinke     425844            R         SL Swinging Strike
## 5 Zack Greinke     425844            R         FF Swinging Strike
## 6 Zack Greinke     425844            R         SL Swinging Strike
##   atbat_result start_speed    z0     x0  pfx_x  pfx_z    px    pz
## 1         Walk        94.2 5.997 -0.675 -4.457  9.760 1.714 1.925
## 2       Single        92.4 6.281 -0.760 -1.590 11.400 0.589 3.271
## 3     Home Run        92.7 6.168 -0.958 -1.884  9.245 0.399 2.918
## 4    Strikeout        86.9 6.077 -0.939  3.594  0.762 0.764 1.306
## 5    Strikeout        92.8 6.107 -0.524 -0.558 11.134 1.517 2.193
## 6    Strikeout        87.8 6.321 -0.948  4.313  0.132 0.695 3.431
##   break_angle break_length spin_rate spin_dir balls strikes outs
## 1        24.8          3.5  2188.802  204.457     2       2    2
## 2        10.1          2.7  2312.202  187.913     1       1    0
## 3         9.2          3.5  1889.841  191.468     0       0    1
## 4       -11.4          8.0   693.649  102.648     1       2    0
## 5        -0.4          2.8  2242.916  182.859     1       2    0
## 6       -13.6          7.8   828.693   92.330     2       2    1
##    game_date year month day inning inning_topbot batted_ball_type
## 1 2015-10-03 2015    10  03      4           top                 
## 2 2015-10-03 2015    10  03      3           top                 
## 3 2015-10-03 2015    10  03      5           top                 
## 4 2015-10-03 2015    10  03      6           top                 
## 5 2015-10-03 2015    10  03      8           top                 
## 6 2015-10-03 2015    10  03      1           top                 
##   batted_ball_velocity   hc_x  hc_y pitch_id distance_feet  july bs_count
## 1                   NA   0.00  0.00      160            NA other      2-2
## 2                  104 123.56 97.26       95             0 other      1-1
## 3                  103  50.88 31.17      218           425 other      0-0
## 4                   NA   0.00  0.00      265            NA other      1-2
## 5                   NA   0.00  0.00      374            NA other      1-2
## 6                   NA   0.00  0.00       14            NA other      2-2

X.2.7 Ball-strike count in July vs. other months

Identify the percentage change in the rate at which Greinke put himself in each of the ball-strike counts.

# Create bs_count_tab
bs_count_tab <- table(greinke$bs_count, greinke$july)

# Create bs_month
bs_month <- round(prop.table(bs_count_tab, margin = 2), 3) ##margin=2 tells R to compute prop within each column

# Print bs_month
bs_month
##      
##        july other
##   0-0 0.261 0.262
##   0-1 0.134 0.135
##   0-2 0.056 0.063
##   1-0 0.105 0.093
##   1-1 0.123 0.113
##   1-2 0.092 0.097
##   2-0 0.029 0.025
##   2-1 0.052 0.053
##   2-2 0.086 0.094
##   3-0 0.006 0.006
##   3-1 0.015 0.015
##   3-2 0.042 0.043

Interpretation

  • It looks like Greinke got into more 1-0 and 2-0 counts, but less two strike counts.
  • That’s a bit unexpected, since he had so much July success.

X.2.8 Visualizing ball-strike count in July vs. other months

Create a bar plot to visualize how common each ball-strike count was in July vs. other months.

# Create diff_bs
diff_bs <- round((bs_month[, 1]- bs_month[, 2]) / bs_month[, 2], 3)

# Print diff_bs
diff_bs
##    0-0    0-1    0-2    1-0    1-1    1-2    2-0    2-1    2-2    3-0 
## -0.004 -0.007 -0.111  0.129  0.088 -0.052  0.160 -0.019 -0.085  0.000 
##    3-1    3-2 
##  0.000 -0.023
# Create a bar plot of the changes
barplot(diff_bs, main = "Ball-Strike Count Rate in July vs. Other Months", 
        ylab = "Percentage Change in July", ylim = c(-0.15, 0.15), las = 2)

X.2.9 Cross-tabulate pitch use in ball-strike counts

See if Greinke used certain pitches more or less often in specific counts overall. In particular, tabulate the proportion of times he throws each pitch for each count.

# Create type_bs
type_bs <- table(greinke$pitch_type, greinke$bs_count)

# Print type_bs
type_bs
##     
##      0-0 0-1 0-2 1-0 1-1 1-2 2-0 2-1 2-2 3-0 3-1 3-2
##   CH  92  93  36  70  79  62  27  46  52   0  18  24
##   CU 124  49  10  34  38   9   4  12   9   0   0   4
##   FF 482 167  61 136 136  89  37  71 109  17  24  69
##   FT  54  55  19  32  50  31  11  18  34   2   3  12
##   SL  93  71  75  35  68 119   5  24  96   0   5  30
# Create type_bs_prop
type_bs_prop <- round(prop.table(type_bs, margin = 2), 3)

# Print type_bs_prop
type_bs_prop
##     
##        0-0   0-1   0-2   1-0   1-1   1-2   2-0   2-1   2-2   3-0   3-1
##   CH 0.109 0.214 0.179 0.228 0.213 0.200 0.321 0.269 0.173 0.000 0.360
##   CU 0.147 0.113 0.050 0.111 0.102 0.029 0.048 0.070 0.030 0.000 0.000
##   FF 0.570 0.384 0.303 0.443 0.367 0.287 0.440 0.415 0.363 0.895 0.480
##   FT 0.064 0.126 0.095 0.104 0.135 0.100 0.131 0.105 0.113 0.105 0.060
##   SL 0.110 0.163 0.373 0.114 0.183 0.384 0.060 0.140 0.320 0.000 0.100
##     
##        3-2
##   CH 0.173
##   CU 0.029
##   FF 0.496
##   FT 0.086
##   SL 0.216

Interpretation

  • He used his 4-seam fastball more than any other pitch in 3-0 counts.
  • He used his slider most when he had 2 strikes on a batter.
  • He never threw his curveball in a 3-0 or a 3-1 count.

X.2.10 Pitch mix late in games

See if Greinke resorts more to his off-speed pitches later in games. There is often talk about pitchers having more trouble late in games. There are a number of reasons for this. They could be getting tired and losing velocity, or batters may have already seen pitches they throw.

Create a variable indicating that a pitch was thrown late in a game, defined as any pitch past the 5th inning. Then, you can make a table of pitch selection for late-game pitches.

# Create the late_in_game column
greinke$late_in_game <- ifelse(greinke$inning > 5, 1, 0)

# Convert late_in_game
greinke$late_in_game <- factor(greinke$late_in_game)

# Create type_late
type_late <- table(greinke$pitch_type, greinke$late_in_game)

# Create type_late_prop
type_late_prop <- round(prop.table(type_late, margin = 2), 3)

# Print type_late_prop
type_late_prop
##     
##          0     1
##   CH 0.178 0.204
##   CU 0.086 0.102
##   FF 0.444 0.403
##   FT 0.107 0.080
##   SL 0.185 0.211

X.2.11 Late game pitch mix - grouped barplots

Make use of a grouped barplot, so that you can assess whether there are changes in pitch selection for specific pitches early vs. late in the game.

# Create t_type_late
t_type_late_prop <- t(type_late_prop) #replace type_late with type_late_prop in this section; datacamp used type_late for type_late_prop; confusing!

# Print dimensions of t_type_late
dim(t_type_late_prop)
## [1] 2 5
# Print dimensions of type_late
dim(type_late_prop)
## [1] 5 2
# Change row names
rownames(t_type_late_prop) <- c("Early", "Late")

# Make barplot using t_type_late
barplot(t_type_late_prop, 
        beside = TRUE,    #so that the bars are grouped by pitch type
        col = c("red", "blue"), 
        main = "Early vs. Late In Game Pitch Selection", 
        ylab = "Pitch Selection Proportion", 
        legend = rownames(t_type_late_prop))

Interpretation

  • He uses his fastballs less often late in games.

X.3 Exploring Pitch Locations

X.3.1 Locational changes - summary

Location variables:

  • px horizontal location
    • a positive value is outside against righties and inside against lefties; and
    • a negative value is outside against lefties and inside against righties.
  • pz vertical location
  • The variable is recorded in feet so it should be multiplied by 12 so that your answer is in inches.

Calculate the average pitch height pz for Greinke in July relative to other months using the code provided.

# Calculate average pitch height in inches in July vs. other months
pitch_heights <- tapply(greinke$pz, greinke$july, mean) * 12 #this name is assigned in next page
pitch_heights
##     july    other 
## 26.26002 26.39904

Find the average horizontal location px to left-handed batters (LHB) and right-handed batters (RHB), respectively.

# Create greinke_lhb
greinke_lhb <- subset(greinke, batter_stand == "L")

# Create greinke_rhb
greinke_rhb <- subset(greinke, batter_stand == "R")

# Compute average px location for LHB
pitch_width_lhb <- tapply(greinke_lhb$px, greinke_lhb$july, mean) * 12
pitch_width_lhb
##      july     other 
## -4.627355 -6.320144
# Compute average px location for RHB
pitch_width_rhb <- tapply(greinke_rhb$px, greinke_rhb$july, mean) * 12
pitch_width_lhb
##      july     other 
## -4.627355 -6.320144

Interpretation

  • He threw his pitches slightly lower in July relative to other months.
  • He threw his pitches less outside in July relative to other months, to both LHB and RHB.

X.3.2 Locational changes - visualization

It’s often more helpful to visualize the pitch location, rather than guess based on averages of horizontal location numbers.

# Plot location of all pitches
plot(greinke$pz ~ greinke$px,
     col = factor(greinke$july),
     xlim = c(-3, 3))

# Formatting code, don't change this
par(mfrow = c(1, 2))

# Plot the pitch loctions for July
plot(pz ~ px, data = greinke_july,
     col = "red", pch = 16,
     xlim = c(-3, 3), ylim = c(-1, 6),
     main = "July")

# Plot the pitch locations for other months
plot(pz ~ px, data = greinke_other,
     col = "black", pch = 16,
     xlim = c(-3, 3), ylim = c(-1, 6),
     main = "Other months")

Interpretation

  • But this strategy seems difficult to interpret.

X.3.3 Locational changes - plotting a grid

Plotting each group on the different panels didn’t seem to help much. One way to get around the lack of useful interpretation from a scatter plot is to bin the data. Binning data into groups and plotting it as a grid is a way of summarizing the location of pitches.

# Create greinke_sub
greinke_sub <- subset(greinke, px > -2 & px < 2 & pz > 0 & pz < 5)
# Plot pitch location window
plot(x = c(-2, 2), y = c(0, 5), type = "n",
     main = "Greinke Locational Zone Proportions",
     xlab = "Horizontal Location (ft.; Catcher's View)",
     ylab = "Vertical Location (ft.)")

# Add the grid lines
grid(lty = "solid", col = "black")

X.3.4 Binning locational data

There are three new variables behind the scene of datacamp.

  1. zone
    • There are 20 possibilities for the zone variable, numbered 1 through 20.
    • Each classification tells us about the location of the given pitch, binned as a grid across the strike zone and just outside the strike zone.
  2. zone_px
  3. zone_pz
greinke_sub <- cbind(greinke_sub, read.csv("data/greinke_sub_zone.csv"))
str(greinke_sub)
## 'data.frame':    3128 obs. of  38 variables:
##  $ p_name              : Factor w/ 1 level "Zack Greinke": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pitcher_id          : int  425844 425844 425844 425844 425844 425844 425844 425844 425844 425844 ...
##  $ batter_stand        : Factor w/ 2 levels "L","R": 2 2 2 2 2 2 2 2 2 2 ...
##  $ pitch_type          : Factor w/ 5 levels "CH","CU","FF",..: 3 3 3 5 3 5 1 3 5 5 ...
##  $ pitch_result        : Factor w/ 15 levels "Ball","Ball In Dirt",..: 1 14 3 14 14 14 15 3 4 14 ...
##  $ atbat_result        : Factor w/ 24 levels "Bunt Groundout",..: 24 20 12 21 21 21 21 21 10 24 ...
##  $ start_speed         : num  94.2 92.4 92.7 86.9 92.8 87.8 90.3 92.7 85.5 87.3 ...
##  $ z0                  : num  6 6.28 6.17 6.08 6.11 ...
##  $ x0                  : num  -0.675 -0.76 -0.958 -0.939 -0.524 ...
##  $ pfx_x               : num  -4.457 -1.59 -1.884 3.594 -0.558 ...
##  $ pfx_z               : num  9.76 11.4 9.245 0.762 11.134 ...
##  $ px                  : num  1.714 0.589 0.399 0.764 1.517 ...
##  $ pz                  : num  1.93 3.27 2.92 1.31 2.19 ...
##  $ break_angle         : num  24.8 10.1 9.2 -11.4 -0.4 -13.6 22.5 25.1 -8.4 -11.3 ...
##  $ break_length        : num  3.5 2.7 3.5 8 2.8 7.8 7.4 3.8 7.5 7.4 ...
##  $ spin_rate           : num  2189 2312 1890 694 2243 ...
##  $ spin_dir            : num  204 188 191 103 183 ...
##  $ balls               : int  2 1 0 1 1 2 1 0 0 0 ...
##  $ strikes             : int  2 1 0 2 2 2 2 2 0 1 ...
##  $ outs                : int  2 0 1 0 0 1 1 2 2 2 ...
##  $ game_date           : Date, format: "2015-10-03" "2015-10-03" ...
##  $ year                : chr  "2015" "2015" "2015" "2015" ...
##  $ month               : num  10 10 10 10 10 10 10 10 10 10 ...
##  $ day                 : chr  "03" "03" "03" "03" ...
##  $ inning              : int  4 3 5 6 8 1 6 5 8 4 ...
##  $ inning_topbot       : Factor w/ 2 levels "bot","top": 2 2 2 2 2 2 2 2 2 2 ...
##  $ batted_ball_type    : Factor w/ 5 levels "","FB","GB","LD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ batted_ball_velocity: int  NA 104 103 NA NA NA NA NA NA NA ...
##  $ hc_x                : num  0 123.6 50.9 0 0 ...
##  $ hc_y                : num  0 97.3 31.2 0 0 ...
##  $ pitch_id            : int  160 95 218 265 374 14 279 231 386 156 ...
##  $ distance_feet       : int  NA 0 425 NA NA NA NA NA NA NA ...
##  $ july                : chr  "other" "other" "other" "other" ...
##  $ bs_count            : chr  "2-2" "1-1" "0-0" "1-2" ...
##  $ late_in_game        : Factor w/ 2 levels "0","1": 1 1 1 2 2 1 2 1 2 1 ...
##  $ zone                : int  16 7 11 15 12 7 18 11 11 16 ...
##  $ zone_px             : num  1.5 0.5 0.5 0.5 1.5 0.5 -0.5 0.5 0.5 1.5 ...
##  $ zone_pz             : num  1.5 3.5 2.5 1.5 2.5 3.5 0.5 2.5 2.5 1.5 ...
table(greinke_sub$zone)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##  22  23  28   6  92 144 213  56 198 400 509 144 123 364 377 171  18  83 
##  19  20 
##  99  58
# Create greinke_table
greinke_table <- table(greinke_sub$zone)

# Create zone_prop
zone_prop <- round(prop.table(greinke_table), 3)

# Plot strike zone grid, don't change this
#plot_grid() #equivalent to the chunk below.

#copied from X.3.3 Locational changes - plotting a grid
##########################################################
# Plot pitch location window
plot(x = c(-2, 2), y = c(0, 5), type = "n",
     main = "Greinke Locational Zone Proportions",
     xlab = "Horizontal Location (ft.; Catcher's View)",
     ylab = "Vertical Location (ft.)")

# Add the grid lines
grid(lty = "solid", col = "black")
###########################################################

# Add text from zone_prop[1]
text(zone_prop[1], x = -1.5, y = 4.5, cex = 1.5)

Interpretation

  • In 2015, Greinke threw to the top left grid panel approximately 0.7% of the time.

X.3.5 For loops and plotting locational grid proportions

In this exercise, you will use a for loop to plot the proportions for each zone in the grid. This prevents you from having to individually plot the text() for each zone onto the grid with its own line of code.

Note that in the data, each zone is associated with a given zone_px and zone_pz coordinate for plotting the text. Additionally, each zone proportion in the zone_prop table is associated with a given zone number. Now, it’s up to you to create a for loop that plots text from zone_prop onto the grid for all zones (1 through 20).

# Plot strike zone grid, don't change this
#plot_grid() #equivalent to the chunk below.

#copied from X.3.3 Locational changes - plotting a grid
##########################################################
# Plot pitch location window
plot(x = c(-2, 2), y = c(0, 5), type = "n",
     main = "Greinke Locational Zone Proportions",
     xlab = "Horizontal Location (ft.; Catcher's View)",
     ylab = "Vertical Location (ft.)")

# Add the grid lines
grid(lty = "solid", col = "black")
###########################################################


# Plot text using for loop
for(i in 1:20) {
  text(mean(greinke_sub$zone_px[greinke_sub$zone == i]),  #x-coordinate
       mean(greinke_sub$zone_pz[greinke_sub$zone == i]),  #y-coordinate
       zone_prop[i], cex = 1.5)
}

X.3.6 Binned locational differences

Look at his zone location proportion differences between July and other months.

# Create zone_prop_july
zone_prop_july <- round(
  table(greinke_sub$zone[greinke_sub$july == "july"]) /
    nrow(subset(greinke_sub, july == "july")), 3)

# Create zone_prop_other
zone_prop_other <- round(
  table(greinke_sub$zone[greinke_sub$july == "other"]) /
    nrow(subset(greinke_sub, july == "other")), 3)

# Print zone_prop_july
zone_prop_july
## 
##     1     2     3     5     6     7     8     9    10    11    12    13 
## 0.004 0.002 0.006 0.036 0.058 0.060 0.020 0.090 0.126 0.160 0.030 0.040 
##    14    15    16    17    18    19    20 
## 0.128 0.110 0.050 0.002 0.036 0.028 0.016
# Print zone_prop_other
zone_prop_other
## 
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 0.008 0.008 0.010 0.002 0.028 0.044 0.070 0.018 0.058 0.128 0.163 0.049 
##    13    14    15    16    17    18    19    20 
## 0.039 0.114 0.123 0.056 0.006 0.025 0.032 0.019
# Fix zone_prop_july vector, don't change this
# This line is necessary b/c Greinke didn't pitch to zone 4 at all. As a result, zone_prop_july is one observation shorter than zone_prop_other.
zone_prop_july2 <- c(zone_prop_july[1:3], 0.00, zone_prop_july[4:19])
names(zone_prop_july2) <- c(1:20)

# Create zone_prop_diff
zone_prop_diff <- zone_prop_july2 - zone_prop_other

# Print zone_prop_diff
zone_prop_diff
## 
##      1      2      3      4      5      6      7      8      9     10 
## -0.004 -0.006 -0.004 -0.002  0.008  0.014 -0.010  0.002  0.032 -0.002 
##     11     12     13     14     15     16     17     18     19     20 
## -0.003 -0.019  0.001  0.014 -0.013 -0.006 -0.004  0.011 -0.004 -0.003

X.3.7 Plotting zone proportion differences

Add to each zone the corresponding proportion difference you found in the last exercise.

# Plot strike zone grid, don't change this
#plot_grid() #equivalent to the chunk below.

#copied from X.3.3 Locational changes - plotting a grid
##########################################################
# Plot pitch location window
plot(x = c(-2, 2), y = c(0, 5), type = "n",
     main = "Greinke Locational Zone Proportions",
     xlab = "Horizontal Location (ft.; Catcher's View)",
     ylab = "Vertical Location (ft.)")

# Add the grid lines
grid(lty = "solid", col = "black")
###########################################################


# Create for loop
for(i in 1:20) {
  text(mean(greinke_sub$zone_px[greinke_sub$zone == i]),
       mean(greinke_sub$zone_pz[greinke_sub$zone == i]),
       zone_prop_diff[i], cex = 1.5) #in datacamp it's [i,] instead of [] b/c somehow zone_prop_diff is a data frame in datacamp
}

str(zone_prop_diff)
##  table [1:20(1d)] -0.004 -0.006 -0.004 -0.002 0.008 0.014 -0.01 0.002 0.032 -0.002 ...
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:20] "1" "2" "3" "4" ...

Interpretation

  • A positive number indicates pitches were thrown to that part of the grid more often in July than other months.
  • A negative number indicates pitches were thrown to that portion of the zone less often.
  • Greinke pitched to zone 9 3.2% more in July relative to the other months.

X.3.8 Location and ball-strike count

Evaluate the difference in Greinke’s propensity to throw to each zone location depending on the count (i.e. number of balls and strikes).

# Create greinke_zone_tab
greinke_zone_tab <- table(greinke_sub$zone, greinke_sub$bs_count)

# Create zone_count_prop
zone_count_prop <- round(prop.table(greinke_zone_tab, margin = 2), 3)

# Print zone_count_prop
zone_count_prop
##     
##        0-0   0-1   0-2   1-0   1-1   1-2   2-0   2-1   2-2   3-0   3-1
##   1  0.007 0.002 0.006 0.010 0.005 0.025 0.000 0.000 0.000 0.000 0.000
##   2  0.005 0.007 0.018 0.007 0.000 0.014 0.000 0.012 0.007 0.000 0.000
##   3  0.007 0.012 0.012 0.003 0.008 0.025 0.000 0.006 0.010 0.000 0.000
##   4  0.001 0.009 0.006 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
##   5  0.039 0.035 0.012 0.026 0.027 0.018 0.036 0.065 0.007 0.053 0.020
##   6  0.065 0.024 0.006 0.056 0.030 0.021 0.107 0.053 0.041 0.000 0.100
##   7  0.072 0.078 0.054 0.056 0.077 0.050 0.024 0.059 0.092 0.105 0.060
##   8  0.018 0.033 0.048 0.010 0.019 0.028 0.000 0.000 0.003 0.000 0.000
##   9  0.065 0.094 0.030 0.066 0.066 0.050 0.083 0.041 0.054 0.158 0.020
##   10 0.131 0.073 0.054 0.171 0.148 0.050 0.167 0.183 0.126 0.105 0.200
##   11 0.191 0.130 0.078 0.197 0.156 0.121 0.179 0.189 0.153 0.263 0.080
##   12 0.049 0.071 0.078 0.036 0.060 0.032 0.012 0.018 0.044 0.000 0.000
##   13 0.026 0.068 0.072 0.026 0.052 0.074 0.024 0.018 0.010 0.053 0.020
##   14 0.096 0.111 0.114 0.138 0.101 0.138 0.167 0.142 0.129 0.105 0.160
##   15 0.132 0.083 0.120 0.115 0.101 0.113 0.131 0.154 0.139 0.105 0.260
##   16 0.058 0.068 0.096 0.023 0.082 0.057 0.000 0.018 0.061 0.053 0.040
##   17 0.001 0.009 0.018 0.003 0.008 0.014 0.000 0.000 0.007 0.000 0.000
##   18 0.012 0.035 0.048 0.016 0.030 0.060 0.000 0.018 0.041 0.000 0.020
##   19 0.012 0.033 0.072 0.033 0.019 0.060 0.060 0.018 0.054 0.000 0.020
##   20 0.011 0.024 0.054 0.007 0.011 0.050 0.012 0.006 0.020 0.000 0.000
##     
##        3-2
##   1  0.014
##   2  0.022
##   3  0.000
##   4  0.000
##   5  0.014
##   6  0.072
##   7  0.058
##   8  0.000
##   9  0.050
##   10 0.266
##   11 0.216
##   12 0.007
##   13 0.014
##   14 0.101
##   15 0.108
##   16 0.007
##   17 0.000
##   18 0.007
##   19 0.029
##   20 0.014

Interpretation

  • This table is a bit unwieldy.

X.3.9 0-2 vs. 3-0 locational changes

Create a table of differences for just the 0-2 and 3-0 counts, just like you did with the July and other months table.

# Create zone_count_diff
zone_count_diff <- zone_count_prop[, 3] - zone_count_prop[, 10]

# Print the table
zone_count_diff
##      1      2      3      4      5      6      7      8      9     10 
##  0.006  0.018  0.012  0.006 -0.041  0.006 -0.051  0.048 -0.128 -0.051 
##     11     12     13     14     15     16     17     18     19     20 
## -0.185  0.078  0.019  0.009  0.015  0.043  0.018  0.048  0.072  0.054

Interpretation

  • The differences for zones 1 through 20 are printed in order in the console.

X.3.10 Plotting count-based locational differences

Plot this on the same zone grid that you used before.

# Plot grid, don't change this
plot(x = c(-2, 2), y = c(0, 5), type = "n",
     main = "Greinke Locational Zone (0-2 vs. 3-0 Counts)",
     xlab = "Horizontal Location (ft.; Catcher's View)",
     ylab = "Vertical Location (ft.)")
grid(lty = "solid", col = "black")

# Add text to the figure for location differences
for(i in 1:20) {
  text(mean(greinke_sub$zone_px[greinke_sub$zone == i]),
       mean(greinke_sub$zone_pz[greinke_sub$zone == i]),
       round((zone_count_diff)[i], 3), cex = 1.5) #in datacamp it's [i,] instead of [] b/c somehow zone_prop_diff is a data frame in datacamp. And round() b/c for some reason, zone 14 has a long number
}

Interpretation

  • Greinke is throwing pitches to the middle of the strike zone less often in 0-2 counts here.

X.4 Exploring Batted Ball Outcomes

Minimizing damage on each pitch is the key to run prevention by the pitcher. Therefore, you will look closely at outcomes from pitches thrown by Greinke in different months.

X.4.1 Velocity impact on contact rate

Is increased velocity even associated with better outcomes for Greinke?

Analyze the impact of velocity on the likelihood that a pitch is missed by the batter.

# Create batter_swing
no_swing <- c("Ball", "Called Strike", "Ball in Dirt", "Hit By Pitch")
greinke_ff$batter_swing <- ifelse(greinke_ff$pitch_result %in% no_swing, 0, 1)

# Create swing_ff
swing_ff <- subset(greinke_ff, greinke_ff$batter_swing == 1)

# Create the contact variable
no_contact <- c("Swinging Strike", "Missed Bunt")
swing_ff$contact <- ifelse(swing_ff$pitch_result %in% no_contact, 0, 1)

# Create velo_bin: add one line for "Fast"
swing_ff$velo_bin <- ifelse(swing_ff$start_speed < 90.5, "Slow", NA)

swing_ff$velo_bin <- ifelse(swing_ff$start_speed >= 90.5 & swing_ff$start_speed < 92.5, 
  "Medium", swing_ff$velo_bin)

swing_ff$velo_bin <- ifelse(swing_ff$start_speed >= 92.5, 
  "Fast", swing_ff$velo_bin)

# Aggregate contact rate by velocity bin
tapply(swing_ff$contact, swing_ff$velo_bin, mean)
##      Fast    Medium      Slow 
## 0.7938596 0.8328076 0.8433735

Interpretation

  • Batters do make less contact on higher velocity fastballs.

X.4.2 Pitch type impact on contact rate

Now that you have seen the relationship between fastball start_speed and contact rate, let’s explore the relationship between pitch_type and contact rate. As you’re going through this exercise, think about whether you expect the same start_speed relationship to hold for other non-fastball pitches.

This time, because average start_speed varies by pitch_type, you will need to reconfigure your velo_bin variable. Specifically, you’ll structure this into 3 groups for each pitch, with each group the within-pitch start_speed.

# Datacamp does this in the background.
##########################################################
# Create the swings dataset, which includes only pitches at which a batter has swung
no_swing <- c("Ball", "Called Strike", "Ball in Dirt", "Hit By Pitch")
greinke$batter_swing <- ifelse(greinke$pitch_result %in% no_swing, 0, 1)
swings <- subset(greinke, greinke$batter_swing == 1)

# Create a contact variable
no_contact <- c("Swinging Strike", "Missed Bunt")
swings$contact <- ifelse(swings$pitch_result %in% no_contact, 0, 1)

# Create a new function called bin_pitch_speed() for use in calculating velo_bin.
bin_pitch_speed <- function(x) {
  cut(x, breaks = quantile(x, probs = c(0,1/3,2/3,1)), labels = FALSE)
  }
###########################################################

# Create the subsets for each pitch type
swing_ff <- subset(swings, pitch_type == "FF")
swing_ch <- subset(swings, pitch_type == "CH")
swing_cu <- subset(swings, pitch_type == "CU")
swing_ft <- subset(swings, pitch_type == "FT")
swing_sl <- subset(swings, pitch_type == "SL")

# Make velo_bin_pitch variable for each subset
swing_ff$velo_bin <- bin_pitch_speed(swing_ff$start_speed)
swing_ch$velo_bin <- bin_pitch_speed(swing_ch$start_speed)
swing_cu$velo_bin <- bin_pitch_speed(swing_cu$start_speed)
swing_ft$velo_bin <- bin_pitch_speed(swing_ft$start_speed)
swing_sl$velo_bin <- bin_pitch_speed(swing_sl$start_speed)

# Print quantile levels for each pitch
thirds <- c(0, 1/3, 2/3, 1)
quantile(swing_ff$start_speed, probs = thirds)
##        0% 33.33333% 66.66667%      100% 
##      88.2      91.3      92.5      94.9
quantile(swing_ch$start_speed, probs = thirds)
##        0% 33.33333% 66.66667%      100% 
##      80.4      87.8      88.9      91.5
quantile(swing_cu$start_speed, probs = thirds)
##        0% 33.33333% 66.66667%      100% 
##  63.70000  73.30000  75.43333  79.40000
quantile(swing_ft$start_speed, probs = thirds)
##        0% 33.33333% 66.66667%      100% 
##  87.90000  90.50000  91.86667  95.40000
quantile(swing_sl$start_speed, probs = thirds)
##        0% 33.33333% 66.66667%      100% 
##      79.8      86.5      87.6      91.4
head(swing_ff)
##          p_name pitcher_id batter_stand pitch_type    pitch_result
## 2  Zack Greinke     425844            R         FF Swinging Strike
## 5  Zack Greinke     425844            R         FF Swinging Strike
## 12 Zack Greinke     425844            R         FF In play, run(s)
## 14 Zack Greinke     425844            R         FF In play, out(s)
## 15 Zack Greinke     425844            R         FF Swinging Strike
## 22 Zack Greinke     425844            R         FF            Foul
##    atbat_result start_speed    z0     x0  pfx_x  pfx_z     px    pz
## 2        Single        92.4 6.281 -0.760 -1.590 11.400  0.589 3.271
## 5     Strikeout        92.8 6.107 -0.524 -0.558 11.134  1.517 2.193
## 12     Home Run        92.3 6.180 -1.206 -2.761  9.134 -0.165 3.503
## 14     Sac Bunt        90.0 5.987 -1.185 -3.643 10.472 -0.453 2.730
## 15    Strikeout        93.0 5.972 -0.858 -2.308 10.355  0.587 2.567
## 22      Pop Out        92.4 6.273 -0.789 -1.816 11.093 -0.285 3.189
##    break_angle break_length spin_rate spin_dir balls strikes outs
## 2         10.1          2.7  2312.202  187.913     1       1    0
## 5         -0.4          2.8  2242.916  182.859     1       2    0
## 12        15.8          3.5  1928.096  196.749     0       1    1
## 14        21.5          3.6  2164.956  199.108     0       0    1
## 15        13.5          3.1  2136.568  192.519     3       2    0
## 22        12.8          2.9  2236.084  189.261     0       1    1
##     game_date year month day inning inning_topbot batted_ball_type
## 2  2015-10-03 2015    10  03      3           top                 
## 5  2015-10-03 2015    10  03      8           top                 
## 12 2015-10-03 2015    10  03      5           top               FB
## 14 2015-10-03 2015    10  03      3           top               GB
## 15 2015-10-03 2015    10  03      4           top                 
## 22 2015-10-03 2015    10  03      7           top                 
##    batted_ball_velocity   hc_x   hc_y pitch_id distance_feet  july
## 2                   104 123.56  97.26       95             0 other
## 5                    NA   0.00   0.00      374            NA other
## 12                  103  50.88  31.17      219           425 other
## 14                   NA 117.51 192.03      108            NA other
## 15                   NA   0.00   0.00      145            NA other
## 22                   73 175.05 132.80      319             0 other
##    bs_count late_in_game batter_swing contact velo_bin
## 2       1-1            0            1       0        2
## 5       1-2            1            1       0        3
## 12      0-1            0            1       1        2
## 14      0-0            0            1       1        1
## 15      3-2            0            1       0        3
## 22      0-1            1            1       1        2

X.4.3 Velocity impact on contact by pitch type

Now that you’ve created your new velo_bin variable, in this exercise you will evaluate whether there is any relationship between start_speed and contact rate by pitch type.

# Calculate contact rate by velocity for swing_ff
tapply(swing_ff$contact, swing_ff$velo_bin, mean)
##         1         2         3 
## 0.8356808 0.8413462 0.7815534
# Calculate contact rate by velocity for swing_ft
tapply(swing_ft$contact, swing_ft$velo_bin, mean)
##         1         2         3 
## 0.8070175 0.8571429 0.8703704
# Calculate contact rate by velocity for swing_ch
tapply(swing_ch$contact, swing_ch$velo_bin, mean)
##         1         2         3 
## 0.7603306 0.6666667 0.7672414
# Calculate contact rate by velocity for swing_cu
tapply(swing_cu$contact, swing_cu$velo_bin, mean)
##         1         2         3 
## 0.8666667 0.8333333 0.8928571
# Calculate contact rate by velocity for swing_sl
tapply(swing_sl$contact, swing_sl$velo_bin, mean)
##         1         2         3 
## 0.7164179 0.7293233 0.6953125
# Doesn't match datacamp results!!!

Interpretation The relationship beteween contact rate and velocity for other pitches is less clear than it was for four-seam fastballs.