Data viz using GGPLOT and Cricket Data

Data Visualisation

This R markdown document explores how the ggplot2 can be used as a useful tool to visulise some data. The tidyverse contains the ggplot package as well as some other useful data wrangling packages that can be used in R. The dplyr package will also be used to get the data to where we want.

For more info on the ggplot2 package click here

For more info on the dplyr package click here

For the data source the cricketdata package will be used. This package contains various data for all types of cricket competitions globally.

For further info on the cricketdata package click here

For the purpose of this document we will be using the mens BBL data since the upcoming season is right around the corner and people are trying to find hidden gems for their super coach teams.

Getting started

Let’s install and load the required packages. I’ve copied the install.packages commands so all you need to do is delete the # and run the commands:

#install.packages("tidyverse")
#install.packages("cricketdata")
#install.packages("ggrepel")

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(cricketdata)
library(ggrepel)

Getting started

Now that we have loaded our packages, let’s get some data and filter it to the most recent season:

bbl_mens_match <- fetch_cricsheet(
type = "bbb",
gender = "male",
competition = "bbl"
)

#Lets take a quick look at the data format
str(bbl_mens_match)

## tibble [133,547 × 33] (S3: tbl_df/tbl/data.frame)
##  $ match_id              : int [1:133547] 1023649 1023649 1023649 1023649 1023649 1023649 1023649 1023649 1023649 1023649 ...
##  $ season                : chr [1:133547] "2016/17" "2016/17" "2016/17" "2016/17" ...
##  $ start_date            : Date[1:133547], format: "2017-01-28" "2017-01-28" ...
##  $ venue                 : chr [1:133547] "Western Australia Cricket Association Ground" "Western Australia Cricket Association Ground" "Western Australia Cricket Association Ground" "Western Australia Cricket Association Ground" ...
##  $ innings               : int [1:133547] 1 1 1 1 1 1 1 1 1 1 ...
##  $ over                  : num [1:133547] 1 1 1 1 1 1 2 2 2 2 ...
##  $ ball                  : int [1:133547] 1 2 3 4 5 6 1 2 3 4 ...
##  $ batting_team          : chr [1:133547] "Sydney Sixers" "Sydney Sixers" "Sydney Sixers" "Sydney Sixers" ...
##  $ bowling_team          : chr [1:133547] "Perth Scorchers" "Perth Scorchers" "Perth Scorchers" "Perth Scorchers" ...
##  $ striker               : chr [1:133547] "DP Hughes" "DP Hughes" "DP Hughes" "DP Hughes" ...
##  $ non_striker           : chr [1:133547] "MJ Lumb" "MJ Lumb" "MJ Lumb" "MJ Lumb" ...
##  $ bowler                : chr [1:133547] "MG Johnson" "MG Johnson" "MG Johnson" "MG Johnson" ...
##  $ runs_off_bat          : int [1:133547] 0 0 0 0 1 0 0 0 6 1 ...
##  $ extras                : int [1:133547] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ball_in_over          : int [1:133547] 1 2 3 4 5 6 1 2 3 4 ...
##  $ extra_ball            : logi [1:133547] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ balls_remaining       : num [1:133547] 119 118 117 116 115 114 113 112 111 110 ...
##  $ runs_scored_yet       : int [1:133547] 0 0 0 0 1 1 1 1 7 8 ...
##  $ wicket                : logi [1:133547] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wickets_lost_yet      : int [1:133547] 0 0 0 0 0 0 0 0 0 0 ...
##  $ innings1_total        : int [1:133547] 141 141 141 141 141 141 141 141 141 141 ...
##  $ innings2_total        : int [1:133547] 144 144 144 144 144 144 144 144 144 144 ...
##  $ target                : num [1:133547] 142 142 142 142 142 142 142 142 142 142 ...
##  $ wides                 : int [1:133547] NA NA NA NA NA NA NA NA NA NA ...
##  $ noballs               : int [1:133547] NA NA NA NA NA NA NA NA NA NA ...
##  $ byes                  : int [1:133547] NA NA NA NA NA NA NA NA NA NA ...
##  $ legbyes               : int [1:133547] NA NA NA NA NA NA NA NA NA NA ...
##  $ penalty               : logi [1:133547] NA NA NA NA NA NA ...
##  $ wicket_type           : chr [1:133547] "" "" "" "" ...
##  $ player_dismissed      : chr [1:133547] "" "" "" "" ...
##  $ other_wicket_type     : chr [1:133547] "" "" "" "" ...
##  $ other_player_dismissed: chr [1:133547] "" "" "" "" ...
##  $ .groups               : chr [1:133547] "drop" "drop" "drop" "drop" ...

#Filtering to last season
bbl_mens_data <- bbl_mens_match %>% 
  filter(season == "2023/24")

#Looking at the data format
str(bbl_mens_data)

## tibble [9,277 × 33] (S3: tbl_df/tbl/data.frame)
##  $ match_id              : int [1:9277] 1386094 1386094 1386094 1386094 1386094 1386094 1386094 1386094 1386094 1386094 ...
##  $ season                : chr [1:9277] "2023/24" "2023/24" "2023/24" "2023/24" ...
##  $ start_date            : Date[1:9277], format: "2023-12-07" "2023-12-07" ...
##  $ venue                 : chr [1:9277] "Brisbane Cricket Ground, Woolloongabba, Brisbane" "Brisbane Cricket Ground, Woolloongabba, Brisbane" "Brisbane Cricket Ground, Woolloongabba, Brisbane" "Brisbane Cricket Ground, Woolloongabba, Brisbane" ...
##  $ innings               : int [1:9277] 1 1 1 1 1 1 1 1 1 1 ...
##  $ over                  : num [1:9277] 1 1 1 1 1 1 2 2 2 2 ...
##  $ ball                  : int [1:9277] 1 2 3 4 5 6 1 2 3 4 ...
##  $ batting_team          : chr [1:9277] "Brisbane Heat" "Brisbane Heat" "Brisbane Heat" "Brisbane Heat" ...
##  $ bowling_team          : chr [1:9277] "Melbourne Stars" "Melbourne Stars" "Melbourne Stars" "Melbourne Stars" ...
##  $ striker               : chr [1:9277] "UT Khawaja" "UT Khawaja" "UT Khawaja" "UT Khawaja" ...
##  $ non_striker           : chr [1:9277] "C Munro" "C Munro" "C Munro" "C Munro" ...
##  $ bowler                : chr [1:9277] "OP Stone" "OP Stone" "OP Stone" "OP Stone" ...
##  $ runs_off_bat          : int [1:9277] 2 0 0 1 0 0 4 2 1 4 ...
##  $ extras                : int [1:9277] 0 0 0 0 0 0 0 0 0 0 ...
##  $ ball_in_over          : int [1:9277] 1 2 3 4 5 6 1 2 3 4 ...
##  $ extra_ball            : logi [1:9277] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ balls_remaining       : num [1:9277] 119 118 117 116 115 114 113 112 111 110 ...
##  $ runs_scored_yet       : int [1:9277] 2 2 2 3 3 3 7 9 10 14 ...
##  $ wicket                : logi [1:9277] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ wickets_lost_yet      : int [1:9277] 0 0 0 0 0 0 0 0 0 0 ...
##  $ innings1_total        : int [1:9277] 214 214 214 214 214 214 214 214 214 214 ...
##  $ innings2_total        : int [1:9277] 111 111 111 111 111 111 111 111 111 111 ...
##  $ target                : num [1:9277] 215 215 215 215 215 215 215 215 215 215 ...
##  $ wides                 : int [1:9277] NA NA NA NA NA NA NA NA NA NA ...
##  $ noballs               : int [1:9277] NA NA NA NA NA NA NA NA NA NA ...
##  $ byes                  : int [1:9277] NA NA NA NA NA NA NA NA NA NA ...
##  $ legbyes               : int [1:9277] NA NA NA NA NA NA NA NA NA NA ...
##  $ penalty               : logi [1:9277] NA NA NA NA NA NA ...
##  $ wicket_type           : chr [1:9277] "" "" "" "" ...
##  $ player_dismissed      : chr [1:9277] "" "" "" "" ...
##  $ other_wicket_type     : chr [1:9277] "" "" "" "" ...
##  $ other_player_dismissed: chr [1:9277] "" "" "" "" ...
##  $ .groups               : chr [1:9277] "drop" "drop" "drop" "drop" ...

Before we start data wrangling let’s think about what we want to portray.

Let’s do some basic visualizations of player stats per game and averages. To do this we will need to clean our data.

Data Cleaning and Wrangling

Now that we know what we want to do let’s get into data wrangling:

First create a new column that gives us the Team matchups. This will allow us to look at player data on a game level. We will also be able to generate average data later.

#Let's create a new column for team matchups
bbl_mens_data <- bbl_mens_data %>% 
  mutate(match_teams = paste(batting_team, "vs", bowling_team))

Next lets Create a summary stats for players in each game:

# Batting player summary
batting_player_summary <- bbl_mens_data %>%
  group_by(match_id, match_teams, striker) %>%
  summarize(
    total_runs = sum(runs_off_bat, na.rm = TRUE),
    boundaries = sum(if_else(runs_off_bat >= 4 & runs_off_bat < 6, 1, 0), na.rm = TRUE),
    sixes = sum(if_else(runs_off_bat == 6, 1, 0), na.rm = TRUE),
    total_balls_faced = n(),
    method_of_dismissal = ifelse(all(wicket_type == ""), "not out", paste(unique(wicket_type), collapse = "")),  # Show "not out" if blank
    .groups = "drop"
  ) %>%
  rename(player = striker) %>%
  mutate(role = "batsman",
         strike_rate = (total_runs / total_balls_faced) * 100,  # Calculate strike rate
         team = gsub(" vs .*", "", match_teams)  # Extract the first team using gsub
  )

# Bowling player summary
bowling_player_summary <- bbl_mens_data %>%
  group_by(match_id, match_teams, bowler) %>%
  summarize(
    total_wickets = sum(wicket, na.rm = TRUE),
    total_balls_bowled = n(),  # Count total deliveries bowled
    total_runs_conceded = sum(runs_off_bat + extras, na.rm = TRUE),  # Calculate total runs conceded
    caught = sum(wicket_type == "caught", na.rm = TRUE),
    run_out = sum(wicket_type == "run out", na.rm = TRUE),
    bowled = sum(wicket_type == "bowled", na.rm = TRUE),
    stumped = sum(wicket_type == "stumped", na.rm = TRUE),
    lbw = sum(wicket_type == "lbw", na.rm = TRUE),
    caught_and_bowled = sum(wicket_type == "caught and bowled", na.rm = TRUE),
    hit_wicket = sum(wicket_type == "hit wicket", na.rm = TRUE),
    obstructing_the_field = sum(wicket_type == "obstructing the field", na.rm = TRUE),
    .groups = "drop"
  ) %>%
  rename(player = bowler) %>%
  mutate(role = "bowler",
         total_overs_bowled = total_balls_bowled / 6,  # Convert balls to overs
         economy_rate = total_runs_conceded / total_overs_bowled,  # Calculate economy rate
         team = gsub(".* vs ", "", match_teams)  # Extract the second team
  )

For batters, the batting_player_summary includes key metrics such as total runs scored, the number of boundaries and sixes hit, and the total balls faced, which together provide insights into scoring efficiency. Additionally, we’ve introduced a method of dissmisal column that indicates whether a player was “not out” or specifies the type of wicket when applicable. The strike rate has also been calculated to assess each batter’s scoring speed, presenting a clear picture of their performance in the match context.

On the bowling side, the bowling_player_summary captures essential statistics such as the total wickets taken, total balls bowled, and total runs conceded, allowing for an evaluation of a bowler’s effectiveness. We have also included detailed wicket type counts to provide insights into how wickets were achieved. Furthermore, the summary features the total overs bowled and the economy rate, which helps gauge the bowler’s ability to restrict runs during their overs.

Together, these metrics provide an overview of player performances in the matches from the 2023/24 season of the BBL.

Data Visualization

Geom_point

Let’s use geom_point to represent runs scored vs balls faced for the batters in the Melbourne Stars vs Rengades Game

# Filter for the specific match (Melbourne Stars vs Renegades)
stars_vs_renegades <- batting_player_summary %>%
  filter(match_teams == c("Melbourne Stars vs Melbourne Renegades", "Melbourne Renegades vs Melbourne Stars"))  # Filter for the specific match

# Create the scatter plot
ggplot(stars_vs_renegades, 
       aes(x = total_balls_faced, y = total_runs, 
           color = team)) +
  geom_point(size = 1) +  # Points for each player
  geom_text(aes(label = player), vjust = -1, hjust = 0.5, size = 4) +  # Add player names as labels
  labs(
    title = "Runs Scored vs Balls Faced: Melbourne Stars vs Renegades",
    x = "Total Balls Faced",
    y = "Total Runs Scored"
  ) +
  scale_color_manual(values = c("red", "green")) +  # Specify colors for teams
  theme_light()

We can see that the plot doesn’t look very nice and is quite clustered.

Lets use a Facet Wrap

Facet Wrap

# Create the scatter plot with facets
ggplot(stars_vs_renegades, 
       aes(x = total_balls_faced, y = total_runs)) +
  geom_point(size = 3) +  # Increase point size for better visibility
  geom_text_repel(aes(label = player), vjust = -1, hjust = 0.5, size = 4) +  # Add player names as labels
  labs(
    title = "Runs Scored vs Balls Faced: Melbourne Stars vs Renegades",
    x = "Total Balls Faced",
    y = "Total Runs Scored"
  ) +
  theme_light() +  # Use a clean theme for clarity
  theme(plot.title = element_text(hjust = 0.5)) +  # Center the title
  facet_wrap(~ team)  # Facet by team

The geom_point does an okay job at portraying the data but let’s use a geom_jitter since it does a slightly better job with data points that are clustered. Let’s also explore the theme_minimal from ggplot2 which adjusts how the graph backgrounds look. There are a range of different themes that can be used and they can also be manually set up too.

Geom_jitter

ggplot(stars_vs_renegades, aes(x = total_balls_faced, y = total_runs, color = team)) +
  geom_jitter(size = 3, width = 0.1, height = 0) +
  geom_text_repel(aes(label = player), size = 4) +
  labs(
    title = "Runs Scored vs Balls Faced: Melbourne Stars vs Renegades",
    x = "Total Balls Faced",
    y = "Total Runs Scored"
  ) +
  scale_color_manual(values = c("Melbourne Stars" = "lightgreen", "Melbourne Renegades" = "red")) +  # Specify colors for teams
  theme_minimal()

Lets Facet Wrap this now

ggplot(stars_vs_renegades, aes(x = total_balls_faced, y = total_runs, color = team)) +
  geom_jitter(size = 3, width = 0.1, height = 0) +
  geom_text_repel(aes(label = player), size = 4) +
  labs(
    title = "Runs Scored vs Balls Faced: Melbourne Stars vs Renegades",
    x = "Total Balls Faced",
    y = "Total Runs Scored"
  ) +
  scale_color_manual(values = c("Melbourne Stars" = "lightgreen", "Melbourne Renegades" = "red")) +  # Specify colors for teams
  theme_minimal() +
  facet_wrap(~ team, scales = "free")  # Free scales for x and y axes

Now there are various different ways we can do a deep dive into the batting data and look at other plots. Let’s move into the bowling data.

Bowling Viz

Now let’s filter our data to one game again and look at only one team this time.

Let’s look at the Sydney Sixers bowling data when they play the Sydney Thunder both times.

sixers_thunder <- bowling_player_summary %>%
  filter(match_teams == "Sydney Thunder vs Sydney Sixers")

str(sixers_thunder)

## tibble [13 × 18] (S3: tbl_df/tbl/data.frame)
##  $ match_id             : int [1:13] 1386112 1386112 1386112 1386112 1386112 1386112 1386112 1386127 1386127 1386127 ...
##  $ match_teams          : chr [1:13] "Sydney Thunder vs Sydney Sixers" "Sydney Thunder vs Sydney Sixers" "Sydney Thunder vs Sydney Sixers" "Sydney Thunder vs Sydney Sixers" ...
##  $ player               : chr [1:13] "BJ Dwarshuis" "J Edwards" "JA Davies" "JM Bird" ...
##  $ total_wickets        : int [1:13] 1 3 0 2 0 1 0 1 3 1 ...
##  $ total_balls_bowled   : int [1:13] 24 27 12 24 6 24 6 24 17 15 ...
##  $ total_runs_conceded  : int [1:13] 24 26 9 29 7 47 9 17 25 17 ...
##  $ caught               : int [1:13] 1 3 0 2 0 1 0 1 2 1 ...
##  $ run_out              : int [1:13] 0 0 0 0 0 0 0 0 1 0 ...
##  $ bowled               : int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
##  $ stumped              : int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
##  $ lbw                  : int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
##  $ caught_and_bowled    : int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
##  $ hit_wicket           : int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
##  $ obstructing_the_field: int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
##  $ role                 : chr [1:13] "bowler" "bowler" "bowler" "bowler" ...
##  $ total_overs_bowled   : num [1:13] 4 4.5 2 4 1 ...
##  $ economy_rate         : num [1:13] 6 5.78 4.5 7.25 7 ...
##  $ team                 : chr [1:13] "Sydney Sixers" "Sydney Sixers" "Sydney Sixers" "Sydney Sixers" ...

Let’s explore some economy rates vs wickets taken to determine the best bowler

# Create a bar plot of economy rates vs. total wickets taken
ggplot(sixers_thunder, aes(x = player, y = economy_rate)) +
  geom_bar(stat = "identity", fill = "lightpink") +  # Use a single fill color for the bars
  geom_text(aes(label = total_wickets), 
            vjust = -0.5, size = 4) +  # Add total wickets as labels on top of bars
  labs(
    title = "Economy Rate vs Total Wickets: Best Bowlers",
    x = "Bowlers",
    y = "Economy Rate"
  ) +
  theme_minimal() +
  coord_flip() +  # Flip coordinates for better readability
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Adjust text angle for better visibility

So as we can see some players have 2 values, which means certain ones played in both games. Let’s use the facet wrap again by the match again.

# Create a bar plot of economy rates vs. total wickets taken
ggplot(sixers_thunder, aes(x = player, y = economy_rate)) +
  geom_bar(stat = "identity", fill = "lightpink") +  # Use a single fill color for the bars
  geom_text(aes(label = total_wickets), 
            vjust = +0.5, hjust = +1, size = 4) +  # Add total wickets as labels on top of bars
  labs(
    title = "Economy Rate vs Total Wickets from the Sixers",
    x = "Bowlers",
    y = "Economy Rate"
  ) +
  theme_minimal() +
  coord_flip() +  # Flip coordinates for better readability
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  facet_wrap(~ match_id)  # Create separate plots for each match

That looks a lot better.

Let’s now quickly look at some average stats for the bowlers from each team. We’ll use a facet wrap to separate by team and look at the economy rates and wickets taken on average per game.

Visualizing the Average Stats

Let’s begin by generating the averages.

bowling_avg <- bowling_player_summary %>% 
  group_by(player, team) %>% 
  summarise(avg_wickets = round(mean(total_wickets), 2),
            avg_economy = round(mean(economy_rate)), 2)

## `summarise()` has grouped output by 'player'. You can override using the
## `.groups` argument.

Now that we have these metrics let’s visualize them.

ggplot(bowling_avg, aes(x = player, y = avg_economy)) +
  geom_bar(stat = "identity") +  # Use a single fill color for the bars
  geom_text(aes(label = avg_wickets), 
            vjust = +0.5, hjust = +1, size = 4) +  # Add total wickets as labels on top of bars
  labs(
    title = "Economy Rate vs Total Wickets for the comp",
    x = "Bowlers",
    y = "Economy Rate"
  ) +
  theme_minimal() +
  coord_flip() +  # Flip coordinates for better readability
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  facet_wrap(~ team, scales = "free")  # Create separate plots for each match

This actually has pretty poor visibility due to the amount of data trying to be portrayed in the small space.

Let’s use the geom_jitter with geom_text_repel

ggplot(bowling_avg, aes(x = avg_economy, y = avg_wickets)) +
  geom_jitter(width = 0.1, height = 0, size = 1) +  # Jitter plot for better separation
  geom_text_repel(aes(label = player), vjust = -0.5, hjust = 1, size = 2) +  # Add total wickets as labels
  labs(
    title = "Average Economy Rate vs Average Wickets for the Competition",
    x = "Economy Rate",
    y = "Wickets"
  ) +
  theme_minimal() +  # Flip coordinates for readability
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  facet_wrap(~ team, ncol = 4, nrow = 3, scales = "free")  # Adjust columns for a more vertical layout

This looks a bit more readable, and now further exploration at the team level could focus on the types of wickets players take.

We could delve even deeper by examining whether a particular bowler consistently takes the wicket of a specific batter throughout the competition. Additionally, we can investigate how economy rates change depending on the opposition, which may suggest that a certain bowler could perform better in Supercoach against a specific opponent. We can also explore how different grounds impact scoring and the types of bowlers taking wickets (spin vs. pace). From this document, you should have enough information on the data manipulation and visualizations that R is capable of.

Happy coding!

Data viz using GGPLOT and Cricket Data

Marcus Rotili

2024-11-01

Data Visualisation

Getting started

Getting started

Data Cleaning and Wrangling

Data Visualization

Geom_point

Facet Wrap

Geom_jitter

Bowling Viz

Visualizing the Average Stats