Amateur Scouting Questionnaire

Question 1

My method for determining the most efficient scenario for draft day consisted of compiling several years of MLB draft data (found in baseballr package) and examining both signing bonuses given to each position, and the number of players that made it (as of October 2022) to the big leagues. The data collected shows that the average signing bonuses for each pick position is as follows:

library(tidyverse) 
library(baseballr)
draft1 <- get_draft_mlb(2017)
draft2 <- get_draft_mlb(2018)
draft3 <- get_draft_mlb(2019)
draft4 <- get_draft_mlb(2020)
draft5 <- get_draft_mlb(2021)
draft6 <- get_draft_mlb(2022)
overall_draft <- rbind(draft1, draft2, draft3, draft4, draft5, draft6, fill = T)
draft_analysis <- overall_draft %>%
  filter(pick_number %in% c(4, 15, 25, 35, 75, 90, 120, 180, 200)) %>%
  select(pick_number, signing_bonus)
signing <- draft_analysis %>%
  filter(!is.na(signing_bonus) & signing_bonus > 0)
signing <- as.data.frame(apply(signing, 2, as.numeric))
signing_avgs <- signing %>%
  group_by(pick_number) %>%
  summarize(signing_bonus = mean(signing_bonus))
signing_avgs

## # A tibble: 9 × 2
##   pick_number signing_bonus
##         <dbl>         <dbl>
## 1           4      6773233.
## 2          15      3687083.
## 3          25      2693720 
## 4          35      2147987 
## 5          75       800600 
## 6          90       677300 
## 7         120       418550 
## 8         180       190000 
## 9         200       122880

scenario_A <- signing_avgs[c(1, 5, 6, 8, 9), ]
scenario_B <- signing_avgs[c(2, 3, 4, 5, 7), ]
sum(scenario_A$signing_bonus)

## [1] 8564013

sum(scenario_B$signing_bonus)

## [1] 9747940

We see that scenario A’s picks would be roughly 1.2 million dollars cheaper (in signing bonuses) than scenario B. However, money isn’t the only factor in determining our draft class. When looking at the last 19 years worth of MLB drafts, we see the number of players who reached the majors diminishes as the draft continues.

drafts <- get_draft_mlb(2002)
drafts1 <- get_draft_mlb(2003)
drafts2 <- get_draft_mlb(2004)
drafts3 <- get_draft_mlb(2005)
drafts4 <- get_draft_mlb(2006)
drafts5 <- get_draft_mlb(2007)
drafts6 <- get_draft_mlb(2008)
drafts7 <- get_draft_mlb(2009)
drafts8 <- get_draft_mlb(2010)
drafts9 <- get_draft_mlb(2011)
drafts10 <- get_draft_mlb(2012)
drafts11 <- get_draft_mlb(2013)
drafts12 <- get_draft_mlb(2014)
drafts13 <- get_draft_mlb(2015)
drafts14 <- get_draft_mlb(2016)
drafts15 <- get_draft_mlb(2017)
drafts16 <- get_draft_mlb(2018)
drafts17 <- get_draft_mlb(2019)
drafts18 <- get_draft_mlb(2020)
drafts19 <- get_draft_mlb(2021)
all_picks<- rbind(drafts1, drafts2, drafts3, drafts4, drafts5, drafts6, drafts7,
               drafts8, drafts9, drafts10, drafts11, drafts12, drafts13, drafts14,
               drafts15, drafts16, drafts17, drafts18, drafts19, fill = TRUE)

pick4 <- subset(all_picks, pick_number == 4) %>%
  select(pick_number, person_first_last_name, person_active, person_mlb_debut_date)
pick15 <- subset(all_picks, pick_number == 15) %>%
  select(pick_number, person_first_last_name, person_active, person_mlb_debut_date)
pick25 <- subset(all_picks, pick_number == 25) %>%
  select(pick_number, person_first_last_name, person_active, person_mlb_debut_date)
pick35 <- subset(all_picks, pick_number == 35) %>%
  select(pick_number, person_first_last_name, person_active, person_mlb_debut_date)
pick75 <- subset(all_picks, pick_number == 75) %>%
  select(pick_number, person_first_last_name, person_active, person_mlb_debut_date)
pick90 <- subset(all_picks, pick_number == 90) %>%
  select(pick_number, person_first_last_name, person_active, person_mlb_debut_date)
pick120 <- subset(all_picks, pick_number == 120) %>%
  select(pick_number, person_first_last_name, person_active, person_mlb_debut_date)
pick180 <- subset(all_picks, pick_number == 180) %>%
  select(pick_number, person_first_last_name, person_active, person_mlb_debut_date)
pick200 <- subset(all_picks, pick_number == 200) %>%
  select(pick_number, person_first_last_name, person_active, person_mlb_debut_date)

rank_picks <- rbind(pick4, pick15, pick25, pick35, pick75, pick90, pick120, pick180, pick200)
played <- rank_picks %>%
  mutate(Played = ifelse(person_mlb_debut_date == "NA", 0, 1)) 
played %>%
  group_by(pick_number) %>%
  summarize(Played = sum(Played, na.rm = TRUE), 
            Pct_played = sum(Played, na.rm = TRUE)/n())

## # A tibble: 9 × 3
##   pick_number Played Pct_played
##         <int>  <dbl>      <dbl>
## 1           4     16      0.842
## 2          15     14      0.737
## 3          25     13      0.684
## 4          35      9      0.474
## 5          75      9      0.474
## 6          90      7      0.368
## 7         120      5      0.263
## 8         180      7      0.389
## 9         200      3      0.167

Using the last 20 years or so of data, players drafted later tend to have a much slimmer chance to make a big league roster. This is noticeably drastic when we see that picks 35 and higher have a less than 50% chance to make it. Players selected from scenario A have a combined percentage chance of making the big leagues of 44.7%, while the players selected in scenario B have a combined percentage of 52.6%. For this reason, I would favor scenario B as the draft choice. Scenario B has less variance and more consistency, particularly at the top of the draft.

Some additional information would have been beneficial in this choice. Mainly, I would have liked to know the player’s likelihood of signing at each potential draft position. This could be done by collecting data from each draft position and looking at how many players out of the total signed. Knowing the potential salary (or signing bonus) limit for the team would have also been helpful, as well as organizational position depth.

Question 2

When looking at the makeup of a Major League roster we see that the majority of players’ country of origin is the United States (~72%, according to Baseball America, 2021). This number seems staggering when considering the popularity of the game in Latin American countries and Asian countries. I believe that one area that an organization can gain a competitive advantage is through extensive scouting and statistical analysis of players from all over the world.

I do want to say that I acknowledge the intensive efforts of all MLB teams to thoroughly identify talent in these countries. However, I would like to apply the same amount of rigor into these lesser known talents that goes into scouting of Division 1, American NCAA teams and players. With the global popularity of the game rising, this could be a way to invest organizational resources into players that can be game changers in the next 5-10 years.

One strategic note: With the wealth of player tracking and biomechanic technology available to teams, one area where scouting can gain a competitive advantage is exploiting opposing player swing paths and comparing those to the particular pitch paths of our pitching staff. In other words, if we know the tendencies in the swing path of the hitter, we can build a roster and game strategy that finds the most difficult pitch path for the hitter to be successful in.

Question 3

Excel (7): I feel very comfortable with this program. It is one that I use daily, although not to its fullest complexity. Of all of the software, this is the one that I have been using the longest. I currently use it to assess data in my current occupation. It helps me to clean data and do simple exploratory analyses.

SQL (6): In my limited experience with SQL (mainly MySQL), I have been accurate, efficient, and comfortable with it. This is the one that I currently use the least, however. In my time during graduate school I used MySQL for several projects in order to extract data on topics ranging from sports, revenue and profit, social media (SEO), customer surveys, and advertising efficiency. I am confident that with more practice, I can build toward a mastery of SQL.

R (8.5): R is my favorite tool, hands down! Almost all of my analysis and exploration is done through R and its vast array of packages. The ones I use most frequently are tidyverse (especially dplyr and forcats), baseballr (and Lahman) for constantly growing my knowledge and understanding of the numbers within baseball, and ggplot2 for the wide range of visualization possibilities. The nflverse and all of its included capabilities are a lot of fun to analyze football trends as well. I have completed several projects within RMarkdown and I am constantly working to improve my dexterity with the program. I dedicate a lot of time to learning R and growing my proficiency on a daily basis.

Question 4

Pitchers <- c("Ball Tracking Information", "Makeup", "In-Game Performance Results", "Athleticism", "Physical Projection")
Position_Players <- c("Ball Tracking Information", "Athleticism", "Makeup", "In-Game Performance Results", "Physical Projection")
attribute_rankings <- data.frame(Pitchers, Position_Players)
attribute_rankings

##                      Pitchers            Position_Players
## 1   Ball Tracking Information   Ball Tracking Information
## 2                      Makeup                 Athleticism
## 3 In-Game Performance Results                      Makeup
## 4                 Athleticism In-Game Performance Results
## 5         Physical Projection         Physical Projection

For both pitchers and position players, I ranked ball tracking information as the highest evaluation metric. The reasoning behind this is the multitude of unbiased, quantitative data. The fast growing technology that is available to an organization is opening new paths to player evaluation and I believe that being on the cutting edge of that technology and data collection is essential to making well-informed decisions. As long as we understand what the metric is devised to quantify, our decision-making process should be sound. Overall, this metric helps evaluators avoid the common biases that plague our predictive processes.

Pitchers: I then chose makeup as the second most important metric for pitchers. The mental tenacity required of a big-league pitcher is difficult to quantify, yet I believe that this asset is something that separates the good pitchers from the great ones. Learning “how to pitch” and being able to mentally self-regulate are pivotal characteristics in a great pitcher. The next important metric is the in-game performance results. While I believe in building and establishing a strong process, results (fielding independent results specifically) are important for a pitcher. Knowing how to use their stuff within the ebb and flow of a game to yield positive results is a part of the pitching process. I believe, next, that a pitcher’s athleticism is important in developing mechanically sound pitching. This factor is important when considering whether or not their delivery (or training) will need adjusting during the course of their career. A pitcher with superior athleticism will be able to make these adjustments in order to adapt their game. I then chose physical projection, which is less important than the previously mentioned metrics. A player’s physical projection doesn’t tell you the quality of pitcher that they will be once they reach the prime of their development. Additionally, the representative population of pitchers is diverse, varying to sometimes extreme degrees.

Position Players: After ball tracking information, I believe that athleticism is a clear defining qualifier for a position player. The physical nature of playing an entire season of baseball (both defensively and at the plate) demands that position players be well-conditioned and capable of utilizing their athleticism. Simply put, position players are getting bigger, stronger, and faster. A higher level of athleticism leads to a higher chance of success. Next, I believe a players makeup helps them take the figurative next step in their development. A player’s baseball IQ, work ethic, tenacity, and willingness to be coached are what ensures a player grows. In order to fully reach one’s potential, a player cannot simply rely on his natural physical skills. This must be in combination with their makeup. I don’t want to gloss over the in-game performance results because, although almost last on my list, these are important. Ultimately, these are the metrics that we are striving to improve in player development. However, I believe in a process of incremental, continual improvement over immediate results. The last metric on my list for position players is physical projection. Much like pitchers, position players come in all shapes and sizes. There have been great players that are undersized, over sized, too skinny, too short, etc. I don’t equate on-field greatness with a particular set of innate physical traits. There is not a traditional prototypical position player. I much more believe in data, athleticism, and “coachability.”

Side note: I believe that all of these categories of predictive metrics work hand in hand, with one determining (and helping) the other in a symbiotic relationship.

Question 5

Using the information provided, I would prefer hitter A. While hitter B has a higher contact percentage on a lower swing rate, player A has a quality mix of exit velocity and launch angle that will result in more extra base hits and potential runs produced. While this could potentially lead to more fly balls, we see that player A’s fly ball rate is half that of player B. These numbers are all essentially based on the same amount of balls put into play. Although I do prefer player A, there is a lot of room for improvement in his game. The exit velocity and launch angle combination could result in a very productive output if player A improves his contact rate and learns to be more selective with the pitches he swings at. I believe this can be learned.

I would, however, like to have more information on these two players in order to form a more accurate assessment. Some more TrackMan information such as launch direction could be valuable here. This would tell us which areas of the field each player favors, hopefully allowing us better insight into their patterns and potential.

Thank you,

Patrick Milum