Section 1:

In baseball, a player records a hit if the hitter is able to reach a base safely after putting a ball in play without fielder error. A player is said to hit a “Home Run” whenever the player is able to run through all 4 bases by batting the ball just once. Usually this is achieved by batting the ball over the fence and out of the reach of all opposing players. But what characteristics of a pitch and the hit itself allows us to tell how far will a certain player will hit the ball? This is the question that we will be aiming to answer throughout this project. To help with this task we have created a data frame, with data gathered from the 2021 MLB regular season by Baseball Savant, and selected variables we thought were of interest. The data frame is comprised of 5935 observation each corresponding to a distinct occasion when a home run was hit. Among the variables of interest, figures the type of hit (fly ball, line drive), the type of pitch (cutter, slider, etc…), the speed at which the ball was batted, and the angle at which the ball has been batted.

Section 2:

Outcome and predictor explanation:

We seek to predict the distance traveled by the ball after it has been hit. To explain this outcome we will use various predictors. We will first look at the angle, in degrees, at which the ball has been hit. This angle is defined by a line parallel to the ground and the trajectory of the ball above it. It is more commonly known as launch angle. We will then look at the speed, in mph, at which the ball has been batted and its impact on the distance traveled by the ball. This is more commonly known as exit velocity or launch speed. We will also look at two different categorical variable, the first being whether the hit is a fly ball or a line drive. A fly ball is characterized by a launching trajectory making an angle between 25 and 50 degrees with the ground. A line drive is any launching trajectory with an angle between 10 and 25 degrees. The second categorical variable is the type of pitch thrown by the pitcher to the batter, which can be of 11 different types. Each pitch type are differentiated in part by their velocity and trajectory. A more defining characteristic of pitch type is the way the pitcher gripped the ball while throwing the pitch.

Challenges we foresee working with the data:

One of the most significant challenges that may arise while working with this data set is the possibility of our variables interacting with each other. This could lead to a not so clear-cut determinant for what characteristics are the most important in determining if a hit will be a home run. It is possible that multiple variables, such as batted ball type and launch angle are equally important and directly correlate with the home run probability, making modeling difficult. The other main challenge that we foresee working with this data set has nothing to do with the data itself. Of our group, only one member is well versed in baseball knowledge as a whole. This provides challenges in interpreting the meaning and significance of different variables and terms within the data. However, while it is useful to have background knowledge in the domain of interest for a statistical analysis, it is not impossible to perform an effective analysis without it. In addition to this, our analysis may include less inherit bias as a result of little background knowledge. With a less predisposed association of terms and variables, our findings may end up being more organic rather than forced to fit our own beliefs.

Preliminary exploratory data analysis:

From this facet wrap, it appears that the Eephus and knuckleball appear to hit a further distance than the types of pitches. However, it does not appear that many of the pitches were Eephus or Knuckleballs which may lead to higher hit distance than what is expected because of a small sample.

## # A tibble: 11 × 3
##    pitch_name      mean_hit sd_hit
##    <chr>              <dbl>  <dbl>
##  1 Eephus              429    NA  
##  2 Knuckleball         416.   23.3
##  3 4-Seam Fastball     402.   25.9
##  4 Fastball            401.   26.7
##  5 Sinker              401.   26.0
##  6 Cutter              401.   25.4
##  7 Slider              400.   26.0
##  8 Changeup            400.   26.3
##  9 Curveball           399.   26.0
## 10 Knuckle Curve       399.   25.8
## 11 Split-Finger        397.   25.5

Of these pairs plots and the correlation data, we can see that hit distance has a strong positive correlation with its launch speed. If a ball is hit faster, it is likely to go further. We can also see that there are slight negative correlations between hit distance and launch angle, and launch angle and launch speed. This means that as the ball is hit further, its launch angle is likely to be shallower, and the shallower the launch angle, the higher the speed. This may pose and issue during data analysis if these variables affect each other since our regression models will have to be under the assumption that other variables do not alter the model.

##   cor(hit_distance, launch_speed) cor(hit_distance, launch_angle)
## 1                       0.6308646                      -0.1674865
##   cor(launch_angle, launch_speed)
## 1                      -0.3302736

Between these two facet grids between fly balls and line drives, we can see that a significantly higher proportion of home run hits are fly balls and they have a higher mean hit distance than line drives. This naturally leads to the question if line drive hits are just underrepresented and could possibly be a better hit than fly balls, but considering how this data from an actual baseball season, it may also show that line drives are more difficult to hit and are not optimal for getting a far-distance home run.

## # A tibble: 2 × 3
##   bb_type    mean_hit_b sd_hit_b
##   <chr>           <dbl>    <dbl>
## 1 fly_ball         402.     26.0
## 2 line_drive       393.     24.6
## # A tibble: 3 × 5
##   term         estimate std.error statistic   p.value
##   <chr>           <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)    -0.139    7.16     -0.0194 0.984    
## 2 launch_speed    3.78     0.0623   60.6    0        
## 3 launch_angle    0.227    0.0528    4.31   0.0000170
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
## 1     0.400         0.400  20.1     1976.       0     2 -26237. 52481. 52508.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

The statistical method we believe will be useful to answer our thesis:

A linear regression model using a portion of our data set in a bootstrap would provide us a model that we could use for predicting how far a home run will travel based on several variables we decide to include. Creating a model from bootstrapping would give us an opportunity to test our model using the remaining portion of the data and seeing how well our model can predict the distance of a hit based on characteristics of the hit and pitch that seem to be strongly correlated to how far a ball will fly.

What results from above methods are need to support hypothesis:

P-value from bootstrap model we create using a sample of the data would support our hypothesis by being able to use our data to create a model for the ideal factors that would lead to a home run while allowing us to test our model for statistical errors. Low RMSE from regression model would also support our hypothesis because if our model can accurately predict which factors will contribute the most to getting a home run and can do this accurately for many other cases, then our model could predict with high accuracy the distance of a home run hit based off of its pitch and hit characteristics. High R squared/adjusted R squared from exploratory data would indirectly support our hypothesis by showing that there is indeed a relationship between the variables we have chosen and the distance that a home run hit will be and correlation data to see the relationship between our variables with each other will show if there are interactions between the variables themselves.

Section 3:

Codebook:

Variable name Description
pitch_type type of pitch, abbreviated
game_day date of game the home run occurred
release_speed out of hand pitch velocity (mph)
player_name name of the home run hitter
batter unique number ID associated with the home run hitter
home_run_location location of home run in terms of outfield positions
batter_hand handedness of the batter at time of home run (R=Right, L=Left)
pitcher_throws handedness of pitcher at time of home run (R=Right, L=Left)
home_team abbreviation of home team
away_team abbreviation of away team
bb_type type of hit the home run was (fly_ball or line_drive)
pitch_pos_hoz horizontal position of ball when crossing the plate in feet relative to the catcher’s view and the center of the plate
pitch_pos_ver vertical position of ball when crossing the plate in feet relative to the catcher’s view and the ground
hit_distance projected distance of the hit
launch_speed velocity of ball of the bat (mph)
launch_angle angle ball came off the bat
effective_speed derived speed based on the the extension of the pitcher’s release
pitch_release_spin_rate spin rate of pitch (rpm)
pitch_release_extension release extension of pitch in feet
pitch_name full name of pitch
change_home_win_expectancy the change in win expectancy before the plate appearance and after the plate appearance
league league of the home team

Dimensions:

Rows: 5935

Columns: 22

## Rows: 5,935
## Columns: 22
## $ pitch_type                 <chr> "SI", "SI", "FF", "FC", "SI", "FF", "FF", "…
## $ game_date                  <chr> "2021-09-01", "2021-06-29", "2021-09-15", "…
## $ release_speed              <dbl> 101.2, 100.9, 100.8, 100.5, 100.5, 100.4, 1…
## $ player_name                <chr> "Rosario, Eddie", "Duvall, Adam", "Marsh, B…
## $ batter                     <int> 592696, 594807, 669016, 476704, 665862, 665…
## $ home_run_location          <chr> "center field", "center field", "left cente…
## $ batter_hand                <chr> "L", "R", "L", "L", "L", "L", "L", "R", "R"…
## $ pitcher_throws             <chr> "R", "L", "R", "R", "L", "R", "R", "R", "R"…
## $ home_team                  <chr> "LAD", "PHI", "CWS", "OAK", "PHI", "NYM", "…
## $ away_team                  <chr> "ATL", "MIA", "LAA", "CLE", "MIA", "MIA", "…
## $ bb_type                    <chr> "fly_ball", "fly_ball", "line_drive", "fly_…
## $ pitch_pos_hoz              <dbl> -0.19, -0.29, -0.11, 0.53, 0.82, 0.06, -0.0…
## $ pitch_pos_ver              <dbl> 2.25, 1.96, 2.36, 2.08, 3.76, 3.61, 2.56, 2…
## $ hit_distance               <int> 390, 418, 401, 393, 378, 402, 451, 427, 368…
## $ launch_speed               <dbl> 100.5, 103.8, 108.9, 103.9, 95.3, 107.4, 11…
## $ launch_angle               <int> 31, 25, 20, 31, 29, 31, 26, 22, 25, 25, 22,…
## $ effective_speed            <dbl> 100.3, 102.0, 101.9, 101.1, 101.6, 101.0, 9…
## $ pitch_release_spin_rate    <int> 2170, 2107, 2624, 2709, 2097, 2376, 2505, 2…
## $ pitch_release_extension    <dbl> 5.6, 6.9, 6.8, 6.5, 6.7, 7.0, 6.5, 6.5, 7.2…
## $ pitch_name                 <chr> "Sinker", "Sinker", "4-Seam Fastball", "Cut…
## $ change_home_win_expectancy <dbl> -0.279, -0.091, -0.295, 0.794, -0.236, -0.1…
## $ league                     <chr> "NL", "NL", "AL", "AL", "NL", "NL", "AL", "…