# importing necessary R packages 
library(tidyverse)
library(dplyr)
library(baseballr)
library(sportyR)
library(splancs)
library(spatstat)
library(RColorBrewer)
library(kableExtra)
library(rmarkdown)

1 Introduction

Baseball data has been analyzed to understand, evaluate and predict players’ performance. Various types of data have been collected during games for analysis. Batting average and on-base percentage are examples of statistics to evaluate offensive performance. However, these statistics are not always sufficient to understand the performance of hitters. Therefore, analysts would seek additional information such as spatial statistics. Spatial statistics have been used to analyze players’ performance in depth. The use of a heat map on the partitioned strike zone could reveal the strengths and weaknesses of a batter based on the locations of pitches. This can help pitchers understand where to pitch to a given batter, which can increase their chances of getting the batter out. The heat map of batted balls on the baseball field could reveal a pattern in the way the batter hit baseball. They also help teams determine how they place fielders on the baseball field. The spatial analysis in baseball have been used in recent studies. For example, Cross and Sylvan (2015) used a covariance function to model spatial batting ability and developed a method for producing more accurate heat maps than the traditional ones through the use of techniques from geostatistics. Therefore, the analysis of spatial statistics has enhanced the overall quality of analyses of baseball players. In this paper, I will examine the offensive performance of the baseball players with spatial data and identify spatial patterns in the ways they hit during games.

2 Description of the Data and Preliminary Data Analysis

I used the data of the top ten players in the batting average in Major League Baseball from the 2021 season. They are Trea Turner, Yuli Gurriel, Juan Soto, Michael Brantley, Vladimir Guerrero Jr., Starling Marte, Bryce Harper, Tim Anderson, Nick Castellanos and Adam Frazier. The function scrape_statcast_savant_batter() from the baseballr package helps extract the data for each of these players from baseballsavant.mlb.com by entering the corresponding batter ID, and the start date and end date of the 2021 season.

After some data cleaning, I extracted the variables hc_x and hc_y which are the x- and y-coordinates of the batted balls for each at-bat (observation), respectively, from the dataset of each player. Then I created new variables hc_x_adjusted and hc_y_adjusted, which are the modified values of hc_x and hc_y, using the formula discussed by Bill Petti (2017):

\[hc\_x\_adjusted= hc\_x - 125.42, \tag{1}\]

\[hc\_y\_adjusted = 198.27 - hc\_y, \tag{2}\]

This adjustment of x- and y-coordinates flips the points around and make the origin (0,0) the home plate of the baseball field.

# retriving the data for the ten players
TreaTurner <- scrape_statcast_savant_batter(start_date = "2021-04-01", end_date = '2021-10-03', batterid = "607208")
YuliGurriel <- scrape_statcast_savant_batter(start_date = "2021-04-01", end_date = '2021-10-03', batterid = "493329")
JuanSoto <- scrape_statcast_savant_batter(start_date = "2021-04-01", end_date = '2021-10-03', batterid = "665742")
MichaelBrantley <- scrape_statcast_savant_batter(start_date = "2021-04-01", end_date = '2021-10-03', batterid = "488726")
VladimirGuerrero <- scrape_statcast_savant_batter(start_date = "2021-04-01", end_date = '2021-10-03', batterid = "665489")
StarlingMarte <- scrape_statcast_savant_batter(start_date = "2021-04-01", end_date = '2021-10-03', batterid = "516782")
BryceHarper <- scrape_statcast_savant_batter(start_date = "2021-04-01", end_date = '2021-10-03', batterid = "547180")
TimAnderson <- scrape_statcast_savant_batter(start_date = "2021-04-01", end_date = '2021-10-03', batterid = "641313")
NickCastellanos <- scrape_statcast_savant_batter(start_date = "2021-04-01", end_date = '2021-10-03', batterid = "592206")
AdamFrazier <- scrape_statcast_savant_batter(start_date = "2021-04-01", end_date = '2021-10-03', batterid = "624428")
# sorting the date for each dataset
TreaTurner <- TreaTurner[order(as.Date(TreaTurner$game_date, format="%m-%d-%Y")),]
YuliGurriel <- YuliGurriel[order(as.Date(YuliGurriel$game_date, format="%m-%d-%Y")),] 
JuanSoto <- JuanSoto[order(as.Date(JuanSoto$game_date, format="%m-%d-%Y")),]
MichaelBrantley <- MichaelBrantley[order(as.Date(MichaelBrantley$game_date, format="%m-%d-%Y")),]
VladimirGuerrero <- VladimirGuerrero[order(as.Date(VladimirGuerrero$game_date, format="%m-%d-%Y")),]
StarlingMarte <- StarlingMarte[order(as.Date(StarlingMarte$game_date, format="%m-%d-%Y")),]
BryceHarper <- BryceHarper[order(as.Date(BryceHarper$game_date, format="%m-%d-%Y")),]
TimAnderson <- TimAnderson[order(as.Date(TimAnderson$game_date, format="%m-%d-%Y")),]
NickCastellanos <- NickCastellanos[order(as.Date(NickCastellanos$game_date, format="%m-%d-%Y")),]
AdamFrazier<- AdamFrazier[order(as.Date(AdamFrazier$game_date, format="%m-%d-%Y")),]

# creating a dataset for each player
TreaTurner <- TreaTurner %>% select(hc_x, hc_y, events, p_throws, launch_angle, launch_speed,launch_speed_angle, hit_distance_sc)
YuliGurriel <- YuliGurriel %>% select(hc_x, hc_y, events, p_throws, launch_angle, launch_speed,launch_speed_angle, hit_distance_sc)
JuanSoto <- JuanSoto %>% select(hc_x, hc_y, events, p_throws, launch_angle, launch_speed,launch_speed_angle, hit_distance_sc)
MichaelBrantley <- MichaelBrantley %>% select(hc_x, hc_y, events, p_throws, launch_angle, launch_speed,launch_speed_angle, hit_distance_sc)
VladimirGuerrero <- VladimirGuerrero %>% select(hc_x, hc_y, events, p_throws, launch_angle, launch_speed,launch_speed_angle, hit_distance_sc)
StarlingMarte <- StarlingMarte %>% select(hc_x, hc_y, events, p_throws, launch_angle, launch_speed,launch_speed_angle, hit_distance_sc)
BryceHarper <- BryceHarper %>% select(hc_x, hc_y, events, p_throws, launch_angle, launch_speed,launch_speed_angle, hit_distance_sc)
TimAnderson <- TimAnderson %>% select(hc_x, hc_y, events, p_throws, launch_angle, launch_speed,launch_speed_angle, hit_distance_sc)
NickCastellanos <- NickCastellanos %>% select(hc_x, hc_y, events, p_throws, launch_angle, launch_speed,launch_speed_angle, hit_distance_sc)
AdamFrazier <- AdamFrazier %>% select(hc_x, hc_y, events, p_throws, launch_angle, launch_speed,launch_speed_angle, hit_distance_sc)
# creating new variables hc_x_adjusted and hc_y_adjusted
# these variables have coordinates make the origin the home plate
TreaTurner$hc_x_adjusted <- TreaTurner$hc_x - 125.42
TreaTurner$hc_y_adjusted <- 198.27 - TreaTurner$hc_y
YuliGurriel$hc_x_adjusted <- YuliGurriel$hc_x - 125.42
YuliGurriel$hc_y_adjusted <- 198.27 - YuliGurriel$hc_y
JuanSoto$hc_x_adjusted <- JuanSoto$hc_x - 125.42
JuanSoto$hc_y_adjusted <- 198.27 - JuanSoto$hc_y
MichaelBrantley$hc_x_adjusted <- MichaelBrantley$hc_x - 125.42
MichaelBrantley$hc_y_adjusted <- 198.27 - MichaelBrantley$hc_y
VladimirGuerrero$hc_x_adjusted <- VladimirGuerrero$hc_x - 125.42
VladimirGuerrero$hc_y_adjusted <- 198.27 - VladimirGuerrero$hc_y
StarlingMarte$hc_x_adjusted <- StarlingMarte$hc_x - 125.42
StarlingMarte$hc_y_adjusted <- 198.27 - StarlingMarte$hc_y
BryceHarper$hc_x_adjusted <- BryceHarper$hc_x - 125.42
BryceHarper$hc_y_adjusted <- 198.27 - BryceHarper$hc_y
TimAnderson$hc_x_adjusted <- TimAnderson$hc_x - 125.42
TimAnderson$hc_y_adjusted <- 198.27 - TimAnderson$hc_y
NickCastellanos$hc_x_adjusted <- NickCastellanos$hc_x - 125.42
NickCastellanos$hc_y_adjusted <- 198.27 - NickCastellanos$hc_y
AdamFrazier$hc_x_adjusted <- AdamFrazier$hc_x - 125.42
AdamFrazier$hc_y_adjusted <- 198.27 - AdamFrazier$hc_y
# removing NA'S
TreaTurner <- na.omit(TreaTurner)
YuliGurriel <- na.omit(YuliGurriel)
JuanSoto <- na.omit(JuanSoto)
MichaelBrantley <- na.omit(MichaelBrantley)
VladimirGuerrero <- na.omit(VladimirGuerrero)
StarlingMarte <- na.omit(StarlingMarte)
BryceHarper <- na.omit(BryceHarper)
TimAnderson <- na.omit(TimAnderson)
NickCastellanos <- na.omit(NickCastellanos)
AdamFrazier <- na.omit(AdamFrazier)

Figure 1 shows the batted balls for Trea Turner, the batting average leader for the 2021 season with a .328 batting average, plotted on the xy-plane. The V-shaped lines represent the foul lines on the baseball field. Each color of dot represents a different outcome for each at-bat. Clearly, there is a cluster of points on the left side of the baseball field. In addition, most of the home runs he hit ended up in the left side of the outfield.

# simple visualization example (with the grid)
TreaTurner_plot_grid <- TreaTurner %>%
  ggplot(aes(x = hc_x_adjusted, y = hc_y_adjusted)) +
  geom_segment(x = 0, xend = -98, y = 0, yend = 98, size = 0.7, color = "grey66", lineend = "round") +
  geom_segment(x = 0, xend = 98, y = 0, yend = 98, size = 0.7, color = "grey66", lineend = "round") +
  geom_curve(x = -48, xend = 49, y = 46, yend = 45,
             curvature = -0.64, linetype = "dotted", color = "grey66") +
  coord_fixed() +
  geom_point(aes(color = events), alpha = 0.5, size = 2) +
  scale_x_continuous(limits = c(-124, 121)) +
  scale_y_continuous(limits = c(-19, 195)) +
  labs(title = "Batted Balls for Trea Turner in 2021 MLB Season",
       color = "events",
       x = "x-coordinate",
       y = "y-coordinate")+
  theme(plot.title = element_text(hjust = 0.5)) 
TreaTurner_plot_grid

Figure 1: Batted balls for Trea Turner in 2021 season on xy-plane

2.1 Creating Point Pattern Dataset

In order to analyze how the players hit during the 2021 season with spatial statistics, the point pattern dataset for each player was created using the function ppp() from the Spatstat package, and x- and y-coordinates of the batted balls. In addition to the x- and y-coordinates of the batted balls, the values of the ranges for the x-coordinates and y-coordinates must be determined so that all the points are included in the dataset. The range of the x-coordinates are [-12, 121] and it was determined based on the leftmost and rightmost points in the dataset of the ten players. The range of the y-coordinates is [-19, 195] and it was determined based on the highest and lowest points in the dataset of the ten players. Therefore, the dimension of the point pattern dataset is 133 units by 214 units. These point pattern dataset for the players will be used for the quadrat test for completely spatial randomness, the visualizations such as heat maps and contour maps and for the analysis of second-order properties in the later sections.

# creating point pattern dataset for each player 
xrange <- range(-124, 121)
yrange <- range(-19, 195)
# Trea Turner
TT_ppp <- ppp(TreaTurner$hc_x_adjusted, TreaTurner$hc_y_adjusted, xrange, yrange)
# Yuli Gurriel
YG_ppp <- ppp(YuliGurriel$hc_x_adjusted, YuliGurriel$hc_y_adjusted, xrange, yrange)
# Juan Soto
JS_ppp <- ppp(JuanSoto$hc_x_adjusted, JuanSoto$hc_y_adjusted, xrange, yrange)
# Michael Brantley
MB_ppp <- ppp(MichaelBrantley$hc_x_adjusted, MichaelBrantley$hc_y_adjusted, xrange, yrange)
# Vladimir Guerrero
VG_ppp <- ppp(VladimirGuerrero$hc_x_adjusted, VladimirGuerrero$hc_y_adjusted, xrange, yrange)
# Starling Marte
SM_ppp <- ppp(StarlingMarte$hc_x_adjusted, StarlingMarte$hc_y_adjusted, xrange, yrange)
# Bryce Harper
BH_ppp <- ppp(BryceHarper$hc_x_adjusted, BryceHarper$hc_y_adjusted, xrange, yrange)
# Tim Anderson
TA_ppp <- ppp(TimAnderson$hc_x_adjusted, TimAnderson$hc_y_adjusted, xrange, yrange)
# Nick Castellanos
NC_ppp <- ppp(NickCastellanos$hc_x_adjusted, NickCastellanos$hc_y_adjusted, xrange, yrange)
# Adam Frazier
AF_ppp <- ppp(AdamFrazier$hc_x_adjusted, AdamFrazier$hc_y_adjusted, xrange, yrange)

3 Methods

In this section, I will discuss how the point pattern dataset of the batters are analyzed.

3.1 Test for Complete Spatial Randomness

The first analysis of the point pattern dataset of the batted balls is the test for complete spatial randomness. In order to conduct this test, the study area where the points are plotted is divided into the squares of equal size. Then, for each square, the number of points are counted to compute the intensity. After computing the intensity, the test statistic is calculated using the following formula:

\[\frac{(m-1)s^2}{\bar{x}}, \tag{3} \]

where m is the number of quadrats, \(s^2\) is the observed variance and the \(\bar{x}\) is the observed mean. Then, this test statistic is compared with the \(\chi^2\)-distribution with m-1 degrees of freedom.

3.2 Heat Maps and Contour Maps

Heat maps and contour maps are visualizations for spatial data that help identify where the points are concentrated. In order to create heat maps and contour maps using R, the bandwidth for the kernel estimation for the intensity in the point pattern dataset for each player was needed to be determined. The function bw.diggle() in the spatstat package calculates the bandwidth that minimizes the mean squared error.

The algorithm in this function utilizes the method of Berman and Diggle (1989) to calculate the quantity

\[M(\sigma) = \frac{MSE(\sigma)}{\lambda^2} - g(0), \tag{4}\] as a function of bandwidth where where is \(MSE(\sigma)\) the mean squared error at bandwidth \(\sigma\), \(\lambda\) is the mean intensity, and g is the pair correlation function.

Figure 2 shows the mean squared error for each value of bandwidth for the point pattern data for Trea Turner. The blue dotted line indicates the optimal bandwidth for the kernel estimation for the intensity in the point pattern dataset of the batted balls for Trea Turner. Based on Figure 2, the optimal bandwidth is between 4 and 6. (see Table 2 in Results for the actual value of the bandwidth)

par(mar = c(5, 5, 1, .2), adj = 0.5)
plot(mserw_TT, main = NULL, xlab = "Bandwidth", ylab = "Mean Squared Error")

Figure 2: Mean squared error at different bandwidths for point pattern dataset for Trea Turner

4 Results

4.1 Test for Completely Spatial Randomness

Figure 3 shows the quadrat test for the point pattern dataset for Trea Turner on three different sizes of the grid. The sizes of the grids are 3 by 3 (left), 4 by 4 (center), and 5 by 5 (right). For each plot, there is a cluster of points and this implies that this dataset does not exhibit complete spatial randomness.

Figure 3: Quadrat test at different grid sizes for point pattern dataset for Trea Turner

Table 1 summarizes the quadrat test for the point pattern dataset for Trea Turner on three different sizes of the grid. Based on the values of the test statistics and critical value at \(\alpha = 0.05\) from Table 1, the null hypothesis that the point pattern data for Trea Turner exhibit complete spatial randomness is rejected at 5% significance level. This result from the quadrat test implies that there is a cluster of points in the dataset (as I pointed out from Figure 3) and there is a spatial pattern in the way he hit. I obtained the same results from the tests for the point pattern dataset for the other nine batters.

kbl(table1, booktabs = T) %>% kable_styling(latex_options =c("HOLD_position", "striped")) %>% footnote(general = "Table 1: Summary of Quadrat Test for Point Pattern Dataset for Trea Turner", general_title = "")

Grid Size	3 x 3	4 x 4	5 x 5
Degrees of Freedom	8	15	24
Test Statistic	481.99	560.62	842
Critical Value	15.507	24.996	34.415
Table 1: Summary of Quadrat Test for Point Pattern Dataset for Trea Turner

# creating densities with the optimal bandwidths
TT_density <- density(TT_ppp, sigma =  4.868395)
YG_density <- density(YG_ppp, sigma =  5.496575)
JS_density <- density(JS_ppp, sigma =  6.438845)
MB_density <- density(MB_ppp, sigma =  4.71135)
VG_density <- density(VG_ppp, sigma =  6.648239)
SM_density <- density(SM_ppp, sigma =  5.077789)
BH_density <- density(BH_ppp, sigma =  7.799902)
TA_density <- density(TA_ppp, sigma =  3.926125)
NC_density <- density(NC_ppp, sigma =  5.444227)
AF_density <- density(AF_ppp, sigma =  4.71135)

4.2 Heat maps

Table 2 summarizes the optimal bandwidth for the kernel estimation for the intensity in the point pattern dataset of the batted balls of each player.

kbl(table, booktabs = T) %>% kable_styling(latex_options = c("HOLD_position", "striped")) %>% footnote(general = "Table 2: Bandwidths for Kernel Estimation for Point Pattern Data for 10 Hitters", general_title = "")

Player Name	Bandwidth
Trea Turner	4.86839530332681
Yuli Gurriel	5.49657534246575
Juan Soto	6.43884540117417
Michael Brantley	4.71135029354207
Vladimir Guerrero Jr.	6.64823874755382
Starling Marte	5.07778864970646
Bryce Harper	7.79990215264188
Tim Anderson	3.9261252446184
Nick Castellanos	5.44422700587084
Adam Frazier	4.71135029354207
Table 2: Bandwidths for Kernel Estimation for Point Pattern Data for 10 Hitters

Figure 4 shows the heat maps for all ten players. The V-shaped lines represent the foul lines on the baseball field. The areas with a high density of points are represented by darker colors. Based on the heat maps, the right handed hitters (Trea Turner, Yuli Gurriel, Vladmir Guerrero Jr., Starling Marte, Tim Anderson and Nick Castellanos) tend to hit to the left side of the baseball field as there are clusters of points on the the left side of the maps. On the other hand, the left handed hitters (Juan Soto, Michael Brantley, Bryce Harper, and Adam Frazier) tend to hit to the right side of the baseball field as there are clusters of points on the the right side of the maps.

# heat maps
par(par(mfrow=c(2,5)),mar = c(3, 3, 1, .2), adj = 0.5)
plot(TT_density, main = "Trea Turner", col = LegendColors)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
plot(YG_density, main = "Yuli Gurriel", col = LegendColors)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
plot(JS_density, main = "Juan Soto", col = LegendColors)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
plot(MB_density, main = "Michael Brantley", col = LegendColors)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
plot(VG_density, main = "Vladimir Guerrero Jr.", col = LegendColors)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
plot(SM_density, main = "Starling Marte", col = LegendColors)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
plot(BH_density, main = "Bryce Harper", col = LegendColors)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
plot(TA_density, main = "Tim Anderson", col = LegendColors)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
plot(NC_density, main = "Nick Castellanos", col = LegendColors)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
plot(AF_density, main = "Adam Frazier", col = LegendColors)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98)

Figure 4: Heat maps of the hitters. A cluster of points are found on the left side of the baseball field for the right handed hitters (Trea Turner, Yuli Gurriel, Vladmir Guerrero Jr., Starling Marte, Tim Anderson and Nick Castellanos). Cluster of points are found on the left side of the baseball field for the left handed hitters (Juan Soto, Michael Brantley, Bryce Harper, and Adam Frazier)

4.3 Contour Maps

Figure 5 shows the contour maps for all ten players. The V-shaped lines represent the foul lines on the baseball field The areas with a high density of points are represented by darker colors. The contour lines separate the areas with different densities of points. Just as the heat maps, a cluster of points are found on the left side of the baseball field for the right handed hitters and the cluster of points are identified on the right side of the baseball field for the left-handed hitters. Therefore, both heat maps and contour maps identified the same pattern in the way these players hit.

# contour maps
par(mfrow = c(2,5), mar = c(3, 3, 1, .2), adj = 0.5)
# Trea Turner
plot(density(TT_ppp), main = "Trea Turner", col = LegendColors2)
contour(density(YG_ppp), add = T)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
# Yuli Gurriel
plot(density(YG_ppp), main = "Yuli Gurriel", col = LegendColors2)
contour(density(YG_ppp), add = T)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
# Juan Soto
plot(density(JS_ppp), main = "Juan Soto", col = LegendColors2)
contour(density(JS_ppp), add = T)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
# Michael Brantley
plot(density(MB_ppp), main = "Michael Brantley", col = LegendColors2)
contour(density(MB_ppp), add = T)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
# Vladimir Guerrero
plot(density(VG_ppp), main = "Vladimir Guerrero", col = LegendColors2)
contour(density(VG_ppp), add = T)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
# Starling Marte
plot(density(SM_ppp), main = "Starling Marte", col = LegendColors2)
contour(density(SM_ppp), add = T)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
# Bryce Harper
plot(density(BH_ppp), main = "Bryce Harper", col = LegendColors2)
contour(density(BH_ppp), add = T)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
# Tim Anderson
plot(density(TA_ppp),main = "Tim Anderson", col = LegendColors2)
contour(density(TA_ppp), add = T)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
# Nick Castellanos
plot(density(NC_ppp),main = "Nick Castellanos", col = LegendColors2)
contour(density(NC_ppp), add = T)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98) 
# Adam Frazier
plot(density(AF_ppp),main = "Adam Frazier", col = LegendColors2)
contour(density(AF_ppp), add = T)
segments(0, 0, -98, 98) 
segments(0, 0, 98, 98)

Contour maps for the ten hitters

5 Analysis of Second-Order Properties

5.1 G-function

To analyze the second properties of the point pattern dataset of the batted balls, I constructed G-, F-, K- and L-functions using the envelope() function from the spatstat package. The G-function calculates the distribution of the distances from an arbitrary event to its nearest event and it is defined as:

\[ G(r) = \frac{\#[r_{min}(s_{i})<r]}{n}, \tag{5}\]

where the numerator is the number of point pairs where \(r_{min} \le r\) and the denominator is the number of points in the study area.

# plot of G-function
par(mfrow=c(2,5), mar = c(2, 2, 1, .2), adj = 0.5)
plot(TT_Gfunction, main = "Trea Turner")
plot(YG_Gfunction, main = "Yuli Gurriel")
plot(JS_Gfunction, main = "Juan Soto")
plot(MB_Gfunction, main = "Michael Brantley")
plot(VG_Gfunction, main = "Vladimir Guerrero")
plot(SM_Gfunction, main = "Starling Marte")
plot(BH_Gfunction, main = "Bryce Harper")
plot(TA_Gfunction, main = "Tim Anderson")
plot(NC_Gfunction, main = "Nick Castellanos")
plot(AF_Gfunction, main = "Adam Frazier")

Figure 6: Envelopes and Empirical G-function for Point Pattern Dataset of 10 hitters

Figure 6 above shows the empirical function \(\hat{G}(r)\) against \(G(r)\) with the 96% pointwise envelopes of the same point pattern examined using the G function. Because \(\hat{G}(r)\) is above the 96% pointwise envelopes, there is a clustered pattern in the dataset of the hitters.

5.2 F-function

The F-function calculates the measures the distribution of all distances from an arbitrary point of the plane to its nearest event and it is defined as:

\[ F(r) = \frac{\#[r_{min}(p_{i}, s)<r]}{m}, \tag{6}\]

where the numerator is the number of point pairs where \(r_{min} \le r\) and the denominator is the number of sample points.

# plot of F-functions
par(mfrow=c(2,5), mar = c(2, 2, 1, .2), adj = 0.5)
plot(TT_Ffunction, main = "Trea Turner")
plot(YG_Ffunction, main = "Yuli Gurriel")
plot(JS_Ffunction, main = "Juan Soto")
plot(MB_Ffunction, main = "Michael Brantley")
plot(VG_Ffunction, main = "Vladimir Guerrero")
plot(SM_Ffunction, main = "Starling Marte")
plot(BH_Ffunction, main = "Bryce Harper")
plot(TA_Ffunction, main = "Tim Anderson")
plot(NC_Ffunction, main = "Nick Castellanos")
plot(AF_Ffunction, main = "Adam Frazier")

Figure 7: Envelopes and Empirical F-function for Point Pattern Dataset of 10 hitters

Figure 7 above shows the empirical function \(\hat{F}(r)\) against \(F(r)\) with the 96% pointwise envelopes of the same point pattern examined using the F function. Because \(\hat{F}(r)\) is plotted below the 96% pointwise envelopes, there is a clustered pattern in the dataset of the hitters.

5.3 K-function

The K-function computes the number of events found up to a given distance of any particular event and it is defined as:

\[ K(h) = \frac{E(\#(\text{events within distance h of randomly chosen event)})]}{\lambda}, \tag{7}\] where E[.] is the expectation and \(\lambda\) is the intensity of events.

Figure 8: Envelopes and Empirical K-function for Point Pattern Dataset of 10 hitters

Figure 8 above shows the empirical function \(\hat{K}(h)\) against \(K(h)\) with the 96% pointwise envelopes of the same point pattern examined using the K function. Because \(\hat{K}(h)\) is outside of the 96% pointwise envelopes, there is a clustered pattern in the dataset of the hitters.

5.4 L-function

The L-function is similar to the K-function as it is a standardized version of the K-function. The L-function is defined as:

\[ L(h) = \sqrt{\frac{K(h)}{\pi}} - h,\tag{8}\]

where K(h) is the K-function and h is the distance.

Figure 9: Envelopes and Empirical L-function for Point Pattern Dataset of 10 hitters

Figure 9 shows the empirical function \(\hat{L}(h)\) against \(L(h)\) with the 96% pointwise envelopes of the same point pattern examined using the L function. Because \(\hat{L}(h)\) is outside of the 96% pointwise envelopes, there is a clustered pattern in the dataset of the hitters.

The analyses of the second-order properties with the G-, F-, K- and L-functions confirm the results from the test for complete spatial randomness of the point pattern dataset of the ten hitters in the Results section.

6 Discussion

Spatial statistics help understand the hitters’ performance in depth. The failure to reject the null hypothesis in the test for complete spatial randomness and the analyses of G-, F-, K- and L-functions indicated that there is a clustered patterns in batted balls. In addition, heat maps and contour maps of the batted balls reveal the spatial pattern of the top ten hitters in batting avarage as they indicated a cluster of points. The left handed hitters tend to hit to the right side of the baseball field and the right handed hitters tend to hit to the left side of the baseball field. In future work, I will consider analyzing the spatial data of the hitters with a new variable such as the launch angle of the batted ball or the exit velocity of the batted ball and determine whether or not these variables have spatial dependence in order to examine hitters’ abilities.

7 References

Berman, M. and Diggle, P. (1989) Estimating weighted integrals of the second-order intensity of a spatial point process. Journal of the Royal Statistical Society, series B 51, 81–92.

Cross, J., & Sylvan, D. (2015). Modeling spatial batting ability using a known covariance matrix. Journal of Quantitative Analysis in Sports, 11, 155 - 167.

Diggle, P.J. (1985) A kernel method for smoothing point process data.Applied Statistics (Journal of the Royal Statistical Society, Series C) 34 (1985) 138–147.

Diggle, P.J. (2003) Statistical analysis of spatial point patterns, Second edition. Arnold.

Petti, B. (2018, January 23). Research notebook: New format for Statcast Data Export at baseball savant. The Hardball Times. Retrieved December 16, 2022, from https://tht.fangraphs.com/research-notebook-new-format-for-statcast-data-export-at-baseball-savant/

Spatial Analysis of Baseball’s Best Hitters

Hisanobu Kaji

2023-07-06