baseballr: Exploring Baseball Data with R

A Deep Dive Using the Chicago Cubs

Zoe Doherty

What is baseballr?

  • Created by Bill Petti in 2016
  • Pulls real baseball data from Baseball Reference, FanGraphs, and MLB Statcast
  • Goes beyond data access, includes functions to calculate advanced metrics
  • Used by MLB analysts, fantasy sports players, and sports journalists

Why Baseball Data?

Traditional stats like ERA and batting average tell only part of the story.

The package gives access to Statcast, the MLB tracking system that measures:

  • Player sprint speed
  • Pitch velocity and spin rate
  • Exit velocity and launch angle

Main Functions Overview

Category Function What it does
Data Extraction statcast_search() Pull pitch-by-pitch Statcast data
Data Extraction bref_team_results() Pull team game results
Metric Calculation group_by and summarise Calculate advanced metrics
Visualization ggplot2 plus Statcast Spray charts, pitch plots

Data Extraction - Cubs 2024 Results

Code
library(baseballr)
library(dplyr)
library(ggplot2)

cubs_results <- bref_team_results("CHC", 2024)

cubs_results |>
  select(Date, H_A, Opp, R, RA, Win, Loss) |>
  head(10)
# A tibble: 10 × 7
   Date             H_A   Opp       R    RA Win       Loss     
   <chr>            <chr> <chr> <dbl> <dbl> <chr>     <chr>    
 1 Thursday, Mar 28 A     TEX       3     4 Robertson Smyly    
 2 Saturday, Mar 30 A     TEX       2    11 Bradford  Hendricks
 3 Sunday, Mar 31   A     TEX       9     5 Neris     Leclerc  
 4 Monday, Apr 1    H     COL       5     0 Imanaga   Hudson   
 5 Tuesday, Apr 2   H     COL      12     2 Assad     Freeland 
 6 Wednesday, Apr 3 H     COL       9     8 Alzolay   Mears    
 7 Friday, Apr 5    H     LAD       9     7 Smyly     Miller   
 8 Saturday, Apr 6  H     LAD       1     4 Yamamoto  Wicks    
 9 Sunday, Apr 7    H     LAD       8     1 Almonte   Stone    
10 Monday, Apr 8    A     SDP       8     9 Peralta   Alzolay  

Data Extraction - Statcast Data

Code
cubs_statcast <- statcast_search(
  start_date  = "2024-04-01",
  end_date    = "2024-04-30",
  playerid    = 669373,
  player_type = "pitcher"
)

glimpse(cubs_statcast)
Rows: 448
Columns: 118
$ pitch_type                               <chr> "FF", "FF", "SL", "SI", "FF",…
$ game_date                                <date> 2024-04-28, 2024-04-28, 2024…
$ release_speed                            <dbl> 93.9, 94.8, 88.5, 98.1, 96.7,…
$ release_pos_x                            <dbl> 1.68, 1.49, 1.93, 1.90, 1.48,…
$ release_pos_z                            <dbl> 6.03, 6.07, 5.93, 5.90, 6.10,…
$ player_name                              <chr> "Skubal, Tarik", "Skubal, Tar…
$ batter                                   <dbl> 641658, 641658, 641658, 68011…
$ pitcher                                  <dbl> 669373, 669373, 669373, 66937…
$ events                                   <chr> "field_out", "", "", "single"…
$ description                              <chr> "hit_into_play", "ball", "bal…
$ spin_dir                                 <lgl> NA, NA, NA, NA, NA, NA, NA, N…
$ spin_rate_deprecated                     <lgl> NA, NA, NA, NA, NA, NA, NA, N…
$ break_angle_deprecated                   <lgl> NA, NA, NA, NA, NA, NA, NA, N…
$ break_length_deprecated                  <lgl> NA, NA, NA, NA, NA, NA, NA, N…
$ zone                                     <dbl> 13, 11, 3, 5, 7, 7, 11, 4, 13…
$ des                                      <chr> "Garrett Hampson flies out to…
$ game_type                                <chr> "R", "R", "R", "R", "R", "R",…
$ stand                                    <chr> "R", "R", "R", "R", "R", "R",…
$ p_throws                                 <chr> "L", "L", "L", "L", "L", "L",…
$ home_team                                <chr> "DET", "DET", "DET", "DET", "…
$ away_team                                <chr> "KC", "KC", "KC", "KC", "KC",…
$ type                                     <chr> "X", "B", "B", "X", "S", "S",…
$ hit_location                             <int> 9, NA, NA, 9, NA, NA, NA, NA,…
$ bb_type                                  <chr> "fly_ball", "", "", "line_dri…
$ balls                                    <int> 2, 1, 0, 0, 0, 0, 3, 3, 2, 1,…
$ strikes                                  <int> 0, 0, 0, 2, 1, 0, 1, 0, 0, 0,…
$ game_year                                <int> 2024, 2024, 2024, 2024, 2024,…
$ pfx_x                                    <dbl> 0.77, 0.35, -0.22, 1.34, 0.80…
$ pfx_z                                    <dbl> 1.35, 1.44, 0.46, 0.79, 1.31,…
$ plate_x                                  <dbl> -0.010756424, -2.570698085, 0…
$ plate_z                                  <dbl> 1.46393901, 2.54414756, 3.159…
$ on_3b                                    <dbl> 592669, 592669, 592669, NA, N…
$ on_2b                                    <dbl> NA, NA, NA, NA, NA, NA, NA, N…
$ on_1b                                    <dbl> 680118, 680118, 680118, 59266…
$ outs_when_up                             <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ inning                                   <dbl> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,…
$ inning_topbot                            <chr> "Top", "Top", "Top", "Top", "…
$ hc_x                                     <dbl> 183.42, NA, NA, 205.28, NA, N…
$ hc_y                                     <dbl> 125.13, NA, NA, 113.12, NA, N…
$ tfs_deprecated                           <lgl> NA, NA, NA, NA, NA, NA, NA, N…
$ tfs_zulu_deprecated                      <lgl> NA, NA, NA, NA, NA, NA, NA, N…
$ umpire                                   <lgl> NA, NA, NA, NA, NA, NA, NA, N…
$ sv_id                                    <lgl> NA, NA, NA, NA, NA, NA, NA, N…
$ vx0                                      <dbl> -6.082126, -11.373332, -2.667…
$ vy0                                      <dbl> -136.4702, -137.4912, -128.85…
$ vz0                                      <dbl> -9.095291, -6.727263, -1.6810…
$ ax                                       <dbl> 10.945953, 6.978991, -1.88493…
$ ay                                       <dbl> 29.13512, 29.85800, 26.29682,…
$ az                                       <dbl> -13.52822, -12.57505, -26.881…
$ sz_top                                   <dbl> 3.450000, 3.437961, 3.463895,…
$ sz_bot                                   <dbl> 1.590000, 1.590393, 1.496189,…
$ hit_distance_sc                          <dbl> 234, NA, NA, 250, NA, NA, NA,…
$ launch_speed                             <dbl> 82.5, NA, NA, 101.9, NA, NA, …
$ launch_angle                             <dbl> 51, NA, NA, 13, NA, NA, NA, N…
$ effective_speed                          <dbl> 94.1, 94.6, 88.4, 98.0, 96.8,…
$ release_spin_rate                        <dbl> 2208, 2227, 2079, 2173, 2305,…
$ release_extension                        <dbl> 6.4, 6.3, 6.2, 6.2, 6.3, 6.3,…
$ game_pk                                  <dbl> 746483, 746483, 746483, 74648…
$ fielder_2                                <dbl> 668670, 668670, 668670, 66867…
$ fielder_3                                <dbl> 592192, 592192, 592192, 59219…
$ fielder_4                                <dbl> 690993, 690993, 690993, 69099…
$ fielder_5                                <dbl> 663837, 663837, 663837, 66383…
$ fielder_6                                <dbl> 656716, 656716, 656716, 65671…
$ fielder_7                                <dbl> 682985, 682985, 682985, 68298…
$ fielder_8                                <dbl> 678009, 678009, 678009, 67800…
$ fielder_9                                <dbl> 672761, 672761, 672761, 67276…
$ release_pos_y                            <dbl> 54.11, 54.19, 54.33, 54.25, 5…
$ estimated_ba_using_speedangle            <dbl> 0.024, NA, NA, 0.892, NA, NA,…
$ estimated_woba_using_speedangle          <dbl> 0.027000, NA, NA, 0.896000, N…
$ woba_value                               <dbl> 0.0, NA, NA, 0.9, NA, NA, 0.7…
$ woba_denom                               <int> 1, NA, NA, 1, NA, NA, 1, NA, …
$ babip_value                              <int> 0, NA, NA, 1, NA, NA, 0, NA, …
$ iso_value                                <int> 0, NA, NA, 0, NA, NA, 0, NA, …
$ launch_speed_angle                       <int> 3, NA, NA, 4, NA, NA, NA, NA,…
$ at_bat_number                            <dbl> 53, 53, 53, 52, 52, 52, 51, 5…
$ pitch_number                             <dbl> 3, 2, 1, 3, 2, 1, 5, 4, 3, 2,…
$ pitch_name                               <chr> "4-Seam Fastball", "4-Seam Fa…
$ home_score                               <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ away_score                               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ bat_score                                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ fld_score                                <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ post_away_score                          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ post_home_score                          <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ post_bat_score                           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ post_fld_score                           <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ if_fielding_alignment                    <chr> "Infield shade", "Standard", …
$ of_fielding_alignment                    <chr> "Standard", "Standard", "Stan…
$ spin_axis                                <dbl> 141, 147, 154, 139, 144, 137,…
$ delta_home_win_exp                       <dbl> 0.059, -0.007, -0.004, -0.036…
$ delta_run_exp                            <dbl> -0.389, 0.060, 0.030, 0.458, …
$ bat_speed                                <dbl> 71.0, NA, NA, 71.8, NA, NA, N…
$ swing_length                             <dbl> 7.6, NA, NA, 6.6, NA, NA, NA,…
$ estimated_slg_using_speedangle           <dbl> 0.036, NA, NA, 1.168, NA, NA,…
$ delta_pitcher_run_exp                    <dbl> 0.389, -0.060, -0.030, -0.458…
$ hyper_speed                              <dbl> 88.0, NA, NA, 101.9, NA, NA, …
$ home_score_diff                          <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ bat_score_diff                           <int> -3, -3, -3, -3, -3, -3, -3, -…
$ home_win_exp                             <dbl> 0.882, 0.889, 0.893, 0.929, 0…
$ bat_win_exp                              <dbl> 0.118, 0.111, 0.107, 0.071, 0…
$ age_pit_legacy                           <int> 27, 27, 27, 27, 27, 27, 27, 2…
$ age_bat_legacy                           <int> 29, 29, 29, 31, 31, 31, 32, 3…
$ age_pit                                  <int> 28, 28, 28, 28, 28, 28, 28, 2…
$ age_bat                                  <int> 30, 30, 30, 31, 31, 31, 32, 3…
$ n_thruorder_pitcher                      <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
$ n_priorpa_thisgame_player_at_bat         <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ pitcher_days_since_prev_game             <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ batter_days_since_prev_game              <int> 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,…
$ pitcher_days_until_next_game             <int> 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,…
$ batter_days_until_next_game              <int> 3, 3, 3, 5, 5, 5, 1, 1, 1, 1,…
$ api_break_z_with_gravity                 <dbl> 1.24, 1.11, 2.44, 1.58, 1.12,…
$ api_break_x_arm                          <dbl> 0.77, 0.35, -0.22, 1.34, 0.80…
$ api_break_x_batter_in                    <dbl> -0.77, -0.35, 0.22, -1.34, -0…
$ arm_angle                                <dbl> 52.7, 49.2, 44.9, 42.2, 50.1,…
$ attack_angle                             <dbl> 14.955356, NA, NA, -2.122450,…
$ attack_direction                         <dbl> 2.5391932, NA, NA, 21.7022771…
$ swing_path_tilt                          <dbl> 38.78021, NA, NA, 30.90978, N…
$ intercept_ball_minus_batter_pos_x_inches <dbl> 37.60109, NA, NA, 39.73812, N…
$ intercept_ball_minus_batter_pos_y_inches <dbl> 31.62287, NA, NA, 13.76641, N…

Understanding the Metrics

Before we calculate, a quick explainer for non-baseball fans:

  • ERA, Earned Run Average, average runs a pitcher gives up per 9 innings, lower is better
  • Exit Velocity, how hard the ball comes off the bat in mph, higher means better contact
  • Launch Angle, the vertical angle the ball leaves the bat, 10 to 30 degrees is ideal
  • Whiff Rate, percentage of swings that completely miss, higher means tougher pitcher

Metric Calculation - Pitch Analysis

Code
cubs_statcast |>
  filter(!is.na(pitch_type)) |>
  group_by(pitch_type) |>
  summarise(
    avg_velocity  = round(mean(release_speed, na.rm = TRUE), 1),
    avg_spin_rate = round(mean(release_spin_rate, na.rm = TRUE), 0),
    whiff_rate    = round(mean(description == "swinging_strike",
                               na.rm = TRUE) * 100, 1),
    n_pitches     = n()
  ) |>
  arrange(desc(n_pitches))
# A tibble: 6 × 5
  pitch_type avg_velocity avg_spin_rate whiff_rate n_pitches
  <chr>             <dbl>         <dbl>      <dbl>     <int>
1 FF                 96.1          2206        6.7       134
2 CH                 85.1          1673       25.4       118
3 SI                 96.2          2097       10.9       101
4 SL                 87.6          2097       13.8        65
5 CU                 78.4          2342        3.4        29
6 FS                 87.7           913        0           1

Metric Calculation - Cubs Season Summary

Code
cubs_results |>
  mutate(
    R  = as.numeric(R),
    RA = as.numeric(RA),
    run_diff = R - RA,
    result = ifelse(run_diff > 0, "Win", "Loss")
  ) |>
  filter(!is.na(run_diff)) |>
  summarise(
    total_wins       = sum(result == "Win"),
    total_losses     = sum(result == "Loss"),
    avg_runs_scored  = round(mean(R, na.rm = TRUE), 2),
    avg_runs_allowed = round(mean(RA, na.rm = TRUE), 2),
    avg_run_diff     = round(mean(run_diff, na.rm = TRUE), 2)
  )
# A tibble: 1 × 5
  total_wins total_losses avg_runs_scored avg_runs_allowed avg_run_diff
       <int>        <int>           <dbl>            <dbl>        <dbl>
1         83           79            4.54             4.13         0.41

Data Visualization - Imanaga Pitch Movement

Code
cubs_statcast |>
  filter(!is.na(pitch_type)) |>
  ggplot(aes(x = pfx_x * 12, y = pfx_z * 12, color = pitch_type)) +
  geom_point(alpha = 0.6, size = 2) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "white") +
  geom_vline(xintercept = 0, linetype = "dashed", color = "white") +
  labs(
    title    = "Shota Imanaga Pitch Movement April 2024",
    subtitle = "Horizontal vs Vertical Break in inches",
    x        = "Horizontal Break inches",
    y        = "Vertical Break inches",
    color    = "Pitch Type",
    caption  = "Data: MLB Statcast via baseballr"
  ) +
  theme_dark() +
  scale_color_brewer(palette = "Set1")

Data Visualization - Cubs 2024 Run Differential

Code
cubs_results |>
  mutate(
    R        = as.numeric(R),
    RA       = as.numeric(RA),
    game_num = row_number(),
    run_diff = R - RA,
    result   = ifelse(run_diff > 0, "Win", "Loss")
  ) |>
  filter(!is.na(run_diff)) |>
  ggplot(aes(x = game_num, y = run_diff, fill = result)) +
  geom_col(alpha = 0.8) +
  scale_fill_manual(values = c("Win" = "#0E3386", "Loss" = "#CC3433")) +
  labs(
    title   = "Chicago Cubs 2024 Run Differential by Game",
    x       = "Game Number",
    y       = "Run Differential",
    fill    = "Result",
    caption = "Data: Baseball Reference via baseballr"
  ) +
  theme_minimal()

Why baseballr Matters

  • Makes pro-level baseball analytics accessible to anyone in R
  • Powers fantasy sports decision-making with real metrics
  • Used by sports journalists and MLB front offices
  • A gateway into the growing field of sports data science

Thank You

Go Cubs Go!

Resources:

  • install.packages(“baseballr”)
  • baseballr docs at billpetti.github.io/baseballr
  • MLB Statcast at baseballsavant.mlb.com