Introduction

In this module we have focused on fact that R’s usefulness in data science comes largely from its package ecosystem. ggshakeR is a good example: it builds on ggplot2 and adds football-specific chart functions that would otherwise take dozens of lines to write from scratch.


Part 1 - Public Data Source: StatsBomb Open Data

What is it?

StatsBomb is one of the leading football data companies, known for detailed event-level data collection. Their Open Data initiative makes a substantial subset of this data freely available on GitHub.

Repository: https://github.com/statsbomb/open-data

What data is available?

Coverage has grown over the years and currently includes competitions like UEFA Euro 2020, FIFA Men’s and Women’s World Cups (including 2022 and 2023), multiple La Liga seasons of Messi at Barcelona, several FA Women’s Super League seasons, Champions League finals, and the Indian Super League. The exact list changes as StatsBomb releases new data, so it is worth checking FreeCompetitions() in R rather than relying on any static list.

Each competition contains match-level and event-level data. Events include passes, shots, dribbles, pressures, ball receipts, and more, each with spatial X/Y coordinates, timestamps, outcome information, and in some cases StatsBomb 360 freeze-frame data showing every player’s position at the moment of the event.

Why is this valuable?

For someone starting out, the appeal is simple. You get real event-level data with coordinates, in the same format professional clubs and scouts work with, served straight from GitHub without registration or rate limits. It is the same dataset we used throughout Module 9 for pass maps, shot maps, game flow charts, and similarity algorithms, so it fits naturally with what the module already teaches.


Part 2 - Specialized R Library: ggshakeR

What is ggshakeR?

ggshakeR is an open-source R package developed by Abhishek Amol Mishra and contributors, built specifically for football data visualization. It sits on top of ggplot2 and provides high-level functions that produce common football analytics charts with minimal setup.

Why ggshakeR?

The module argues that visualization is a critical step in any analytical process, since a well-designed chart communicates findings much more clearly than a table of numbers. ggshakeR is built around that idea. It lets an analyst go from a raw dataframe to a readable football chart without writing pitch-drawing code from scratch.

In practice the main advantages over raw ggplot2 are that pitch rendering is built into the chart functions, StatsBomb and Understat coordinate systems are handled internally, and common chart types like shot maps, pass networks, and heatmaps are ready to use. There is much less boilerplate, which speeds up exploratory work.

Main functions

Function Description
plot_shot() Shot maps with xG encoding: point, hexbin, or density
plot_pass() Pass maps with directional arrows, highlighting progressive passes, crosses, switches
plot_heatmap() Spatial heatmaps of touch or event locations
plot_passnet() Pass network diagrams showing team structure per match
plot_sonar() Pass direction sonar charts
plot_convexhull() Convex hull territory plots per player or team
plot_voronoi() Voronoi diagrams showing territorial dominance
plot_pizza() Percentile pizza plots for player profiles (requires FBref data)

Part 3 - Practical Examples

Three examples are presented, all using StatsBomb Open Data:

  • Shot map with plot_shot(): Spain Women’s at the FIFA Women’s World Cup 2023
  • Pass network with plot_passnet(): both teams in the Barcelona vs Real Madrid Clasico, La Liga 2019/20
  • Heatmap with plot_heatmap(): Messi’s touch distribution in the same Clasico

The Module 9 exercises used the same datasets but built all charts manually with raw ggplot2 and SBPitch. These examples show how ggshakeR shortens that process.

Install and load packages

install.packages("devtools")
devtools::install_github("abhiamishra/ggshakeR")
devtools::install_github("statsbomb/StatsBombR")
install.packages("tidyverse")
install.packages("ggsoccer")
library(ggshakeR)
library(StatsBombR)
library(tidyverse)
library(ggsoccer)

Load all data

Both datasets are loaded upfront so the visualization chunks stay clean.

# --- Women's World Cup 2023 ---
competitions <- FreeCompetitions()
## [1] "Whilst we are keen to share data and facilitate research, we also urge you to be responsible with the data. Please credit StatsBomb as your data source when using the data and visit https://statsbomb.com/media-pack/ to obtain our logos for public use."
wwc <- competitions %>%
  filter(competition_id == 72, season_id == 107)

matches_wwc  <- FreeMatches(wwc)
## [1] "Whilst we are keen to share data and facilitate research, we also urge you to be responsible with the data. Please credit StatsBomb as your data source when using the data and visit https://statsbomb.com/media-pack/ to obtain our logos for public use."
events_wwc   <- free_allevents(MatchesDF = matches_wwc, Parallel = TRUE)
## [1] "Whilst we are keen to share data and facilitate research, we also urge you to be responsible with the data. Please credit StatsBomb as your data source when using the data and visit https://statsbomb.com/media-pack/ to obtain our logos for public use."
events_wwc   <- allclean(events_wwc)

cat("WWC 2023 events loaded:", nrow(events_wwc), "\n")
## WWC 2023 events loaded: 226146
# Barcelona vs Real Madrid, La Liga 2019/20
# competition_id 11 = La Liga, season_id 42 = 2019/2020
laliga_1920 <- competitions %>%
  filter(competition_id == 11, season_id == 42)

matches_laliga <- FreeMatches(laliga_1920)
## [1] "Whilst we are keen to share data and facilitate research, we also urge you to be responsible with the data. Please credit StatsBomb as your data source when using the data and visit https://statsbomb.com/media-pack/ to obtain our logos for public use."
clasico <- matches_laliga %>%
  filter(
    home_team.home_team_name == "Barcelona",
    away_team.away_team_name == "Real Madrid"
  )

events_clasico <- free_allevents(MatchesDF = clasico, Parallel = FALSE)
## [1] "Whilst we are keen to share data and facilitate research, we also urge you to be responsible with the data. Please credit StatsBomb as your data source when using the data and visit https://statsbomb.com/media-pack/ to obtain our logos for public use."
events_clasico <- allclean(events_clasico)

events_clasico_net <- events_clasico %>%
  rename(
    x      = location.x,
    y      = location.y,
    finalX = pass.end_location.x,
    finalY = pass.end_location.y
  )

cat("Clasico events loaded:", nrow(events_clasico), "\n")
## Clasico events loaded: 4158

Example 1 - Shot Map: Spain Women’s (WWC 2023)

plot_shot() was originally designed for Understat data, where coordinates are in the 0-1 range. StatsBomb uses a 120x80 pitch, so coordinates need to be normalised before being passed to the function.

# Prepare Spain shots, normalising StatsBomb coordinates to 0-1
spain_shots <- events_wwc %>%
  filter(
    team.name == "Spain Women's",
    type.name == "Shot"
  ) %>%
  mutate(
    X      = location.x / 120,
    Y      = location.y / 80,
    xG     = shot.statsbomb_xg,
    result = shot.outcome.name,
    player = player.name
  ) %>%
  select(X, Y, xG, result, player)

# Point shot map: each dot is one shot, size = xG
plot_shot(
  data             = spain_shots,
  type             = "point",
  highlight_goals  = TRUE,
  average_location = FALSE
) +
  labs(
    title    = "Spain Women's - Shot Map",
    subtitle = "FIFA Women's World Cup 2023 | All matches | Point size = xG",
    caption  = "Data: StatsBomb Open Data | Visualization: ggshakeR"
  )

# Density shot map: shows zones of highest shooting frequency
plot_shot(
  data = spain_shots,
  type = "density"
) +
  labs(
    title    = "Spain Women's - Shot Density",
    subtitle = "Shooting frequency by pitch zone | FIFA Women's World Cup 2023",
    caption  = "Data: StatsBomb Open Data | Visualization: ggshakeR"
  )

Interpretation: The point map makes it easy to spot Spain’s high-xG chances, mostly clustered inside the six-yard box and the penalty spot area, with a visible cluster of goals there. There is also a noticeable tail of low-xG shots from outside the box, which is consistent with a team dominating possession and sometimes settling for a speculative effort when the central route is blocked. The density map reinforces this: the hotspot sits right on the penalty spot rather than on the flanks, which fits how Spain attacked at that tournament, working the ball through the middle.


Example 2 - Pass Networks: Barcelona vs Real Madrid (Clasico 2019/20)

Pass networks show the average position of each player along with the passing connections between them. They are useful for reading team shape and spotting the main build-up hubs. plot_passnet() builds them directly from StatsBomb event data.

plot_passnet(
  data        = events_clasico_net,
  data_type   = "statsbomb",
  team_name   = "Barcelona",
  scale_color = "#a50044"
) +
  labs(
    title    = "Barcelona - Pass Network",
    subtitle = "Barcelona vs Real Madrid | La Liga 2019/20",
    caption  = "Data: StatsBomb Open Data | Visualization: ggshakeR"
  )

plot_passnet(
  data        = events_clasico_net,
  data_type   = "statsbomb",
  team_name   = "Real Madrid",
  scale_color = "#ffd700"
) +
  labs(
    title    = "Real Madrid - Pass Network",
    subtitle = "Barcelona vs Real Madrid | La Liga 2019/20",
    caption  = "Data: StatsBomb Open Data | Visualization: ggshakeR"
  )


Example 3 - Heatmap: Messi in the Clasico

A heatmap shows where a player spent his time on the ball across the match. The natural fit here would be plot_heatmap(), but I ran into a compatibility issue between ggshakeR and the current ggplot2 that made it unusable in this setup (details in Part 5). The workaround is to build the heatmap manually with ggplot2 + ggsoccer, which is a reasonable fallback and also lines up nicely with the comparison in Part 4.

messi_touches <- events_clasico %>%
  filter(
    player.name == "Lionel Andrés Messi Cuccittini",
    !is.na(location.x),
    !is.na(location.y)
  )

ggplot(messi_touches, aes(x = location.x, y = location.y)) +
  annotate_pitch(
    dimensions = pitch_statsbomb,
    colour     = "grey60",
    fill       = "white"
  ) +
  geom_hex(bins = 12, alpha = 0.85) +
  scale_fill_gradient(low = "#fff5f0", high = "#a50044", name = "Touches") +
  coord_flip(xlim = c(0, 120), ylim = c(0, 80)) +
  theme_pitch() +
  labs(
    title    = "Lionel Messi - Touch Heatmap",
    subtitle = "Barcelona vs Real Madrid | La Liga 2019/20",
    caption  = "Data: StatsBomb Open Data | Visualization: ggplot2 + ggsoccer"
  )

Interpretation: Most of Messi’s touches sit in the right half-space just outside the box, not centrally. That fits how Barcelona used him this season: drifting wide to receive the ball rather than playing as a pure nine, then carrying or combining into the final third. The cluster of touches near the halfway line also reflects how deep he dropped to find the ball when Real Madrid’s press made the central route difficult.


Part 4 - Comparison: ggshakeR vs Raw ggplot2

To make the difference concrete, here is the same shot map built two ways: once with ggshakeR, and once from scratch with ggplot2 and ggsoccer.

Approach A - ggshakeR

# Pitch, coordinates, colours, and layout all handled internally
plot_shot(
  data            = spain_shots,
  type            = "point",
  highlight_goals = TRUE
) +
  labs(title = "Spain Women's Shot Map - ggshakeR approach")

Approach B - Raw ggplot2 + ggsoccer

# Using StatsBomb native coordinates directly from events_wwc
spain_shots_raw <- events_wwc %>%
  filter(
    team.name == "Spain Women's",
    type.name == "Shot"
  ) %>%
  mutate(
    x       = location.x,
    y       = location.y,
    xg_val  = shot.statsbomb_xg,
    is_goal = shot.outcome.name == "Goal"
  )

ggplot(spain_shots_raw) +
  annotate_pitch(
    dimensions = pitch_statsbomb,
    colour     = "grey60",
    fill       = "white"
  ) +
  geom_point(
    aes(
      x     = x,
      y     = y,
      size  = xg_val,
      color = is_goal,
      alpha = xg_val
    )
  ) +
  scale_color_manual(
    values = c("FALSE" = "steelblue", "TRUE" = "red"),
    labels = c("FALSE" = "No goal", "TRUE" = "Goal"),
    name   = "Outcome"
  ) +
  scale_size(range  = c(1, 6), name = "xG") +
  scale_alpha(range = c(0.4, 0.9), guide = "none") +
  coord_flip(xlim = c(0, 120), ylim = c(0, 80)) +
  theme_pitch() +
  labs(title = "Spain Women's Shot Map - raw ggplot2 approach")

What this comparison shows

The ggshakeR version produces a very similar output with far less code. Pitch dimensions, coordinate handling, and colour defaults are taken care of internally, whereas in the raw version every one of those decisions has to be made by hand. The raw ggplot2 approach wins on flexibility: custom colour schemes, faceting by season, different aspect ratios, anything you want is possible. ggshakeR wins on speed of iteration when the chart you want is one of the ones it already supports.

This is essentially the trade-off the module describes between specialized packages and base ggplot2. For exploratory work or a quick report, ggshakeR is faster. For anything custom, building from scratch makes more sense.


Part 5 - Limitations

ggshakeR maintenance status

The last tagged release of ggshakeR is 0.1.2 from March 2022, with a dev version (0.2.0.9002) on GitHub that has not seen meaningful updates since 2023. In practice this means the package has not kept up with the current ggplot2 / tidyverse line, and that led to a concrete blocker in this document.

Since ggplot2 4.0 (released 2025), ggplot objects use the S7 object system rather than the older S3 one. ggshakeR’s internal plotting functions were written against the S3 behaviour, and the result in a current setup (ggplot2 4.0.2 + ggshakeR 0.2.0.9002) is that functions like plot_heatmap() return a valid ggplot object, but adding layers on top or printing the plot in a knitted document silently fails and produces an empty render. No error, just a blank figure. It only became obvious when I compared nrow() of the filtered data (200+ rows) with class() of the returned plot (which contained S7_object).

This is why Example 3 falls back to raw ggplot2 + ggsoccer: the manual version works regardless of ggplot2’s object system. The same class of issue would likely hit most other ggshakeR pitch-plot functions until the package is updated for S7. plot_shot() and plot_passnet() happened to still render in my tests, but I would not rely on any ggshakeR function surviving a ggplot2 update without checking the output carefully.

If you need ggshakeR to behave like the documentation describes, the practical options are pinning ggplot2 to the 3.5.x line with devtools::install_version("ggplot2", "3.5.1"), or treating ggshakeR as a reference for how these charts can be structured and then building them manually with ggplot2 + ggsoccer.

FBref and Understat scraping restrictions

ggshakeR also supports FBref (via worldfootballR) and Understat (via understatr). Both are worth knowing about, but both have tightened access recently. FBref returns HTTP 403 under higher request rates, and Understat currently blocks automated scraping entirely. This came up while preparing this document, and other students in the cohort ran into the same issue. For anything that needs to knit reliably, StatsBomb is the safer choice because the data comes straight from GitHub.

StatsBomb open data coverage

StatsBomb open data is limited to specific competitions and seasons. What is included is excellent, but recent seasons of most leagues are not. For anything current, a paid subscription or FBref would be needed.


Conclusions

StatsBomb Open Data and ggshakeR together cover most of what someone learning football analytics in R actually needs: reliable event-level data with coordinates, plus a fast path from a dataframe to a readable chart. Neither is perfect. StatsBomb’s free coverage does not extend to current seasons, and ggshakeR has fallen behind the current ggplot2 in ways that silently break plots rather than throwing errors, which is the worst failure mode to debug.

Building these three examples made the trade-off between specialised packages and raw ggplot2 more concrete than I expected. ggshakeR is faster to iterate on when it works (Examples 1 and 2), but Example 3 had to fall back to raw ggplot2 + ggsoccer because of a ggplot2 4.0 S7 incompatibility, and that fallback was not complicated. The raw version is maybe 15-20 lines for a heatmap, and those lines are under your control.

For anyone picking up this stack now, I would still start with StatsBomb for the data, but I would be more cautious about ggshakeR than I was going in. It is a good teaching tool and a good reference for how these charts are structured, but for anything that needs to knit reliably over time, raw ggplot2 + ggsoccer is the safer foundation. FBref and Understat remain tempting on paper, but the scraping restrictions I ran into while preparing this made the choice between them and StatsBomb very easy.