This R markdown document explores how the ggplot2 can be
used as a useful tool to visulise some data. The tidyverse
contains the ggplot package as well as some other useful data wrangling
packages that can be used in R. The dplyr package will also
be used to get the data to where we want.
For more info on the ggplot2 package click here
For more info on the dplyr package click here
For the data source the cricketdata package will be
used. This package contains various data for all types of cricket
competitions globally.
For further info on the cricketdata package click
here
For the purpose of this document we will be using the mens BBL data since the upcoming season is right around the corner and people are trying to find hidden gems for their super coach teams.
Let’s install and load the required packages. I’ve copied the
install.packages commands so all you need to do is delete
the # and run the commands:
#install.packages("tidyverse")
#install.packages("cricketdata")
#install.packages("ggrepel")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(cricketdata)
library(ggrepel)
Now that we have loaded our packages, let’s get some data and filter it to the most recent season:
bbl_mens_match <- fetch_cricsheet(
type = "bbb",
gender = "male",
competition = "bbl"
)
#Lets take a quick look at the data format
str(bbl_mens_match)
## tibble [133,547 × 33] (S3: tbl_df/tbl/data.frame)
## $ match_id : int [1:133547] 1023649 1023649 1023649 1023649 1023649 1023649 1023649 1023649 1023649 1023649 ...
## $ season : chr [1:133547] "2016/17" "2016/17" "2016/17" "2016/17" ...
## $ start_date : Date[1:133547], format: "2017-01-28" "2017-01-28" ...
## $ venue : chr [1:133547] "Western Australia Cricket Association Ground" "Western Australia Cricket Association Ground" "Western Australia Cricket Association Ground" "Western Australia Cricket Association Ground" ...
## $ innings : int [1:133547] 1 1 1 1 1 1 1 1 1 1 ...
## $ over : num [1:133547] 1 1 1 1 1 1 2 2 2 2 ...
## $ ball : int [1:133547] 1 2 3 4 5 6 1 2 3 4 ...
## $ batting_team : chr [1:133547] "Sydney Sixers" "Sydney Sixers" "Sydney Sixers" "Sydney Sixers" ...
## $ bowling_team : chr [1:133547] "Perth Scorchers" "Perth Scorchers" "Perth Scorchers" "Perth Scorchers" ...
## $ striker : chr [1:133547] "DP Hughes" "DP Hughes" "DP Hughes" "DP Hughes" ...
## $ non_striker : chr [1:133547] "MJ Lumb" "MJ Lumb" "MJ Lumb" "MJ Lumb" ...
## $ bowler : chr [1:133547] "MG Johnson" "MG Johnson" "MG Johnson" "MG Johnson" ...
## $ runs_off_bat : int [1:133547] 0 0 0 0 1 0 0 0 6 1 ...
## $ extras : int [1:133547] 0 0 0 0 0 0 0 0 0 0 ...
## $ ball_in_over : int [1:133547] 1 2 3 4 5 6 1 2 3 4 ...
## $ extra_ball : logi [1:133547] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ balls_remaining : num [1:133547] 119 118 117 116 115 114 113 112 111 110 ...
## $ runs_scored_yet : int [1:133547] 0 0 0 0 1 1 1 1 7 8 ...
## $ wicket : logi [1:133547] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ wickets_lost_yet : int [1:133547] 0 0 0 0 0 0 0 0 0 0 ...
## $ innings1_total : int [1:133547] 141 141 141 141 141 141 141 141 141 141 ...
## $ innings2_total : int [1:133547] 144 144 144 144 144 144 144 144 144 144 ...
## $ target : num [1:133547] 142 142 142 142 142 142 142 142 142 142 ...
## $ wides : int [1:133547] NA NA NA NA NA NA NA NA NA NA ...
## $ noballs : int [1:133547] NA NA NA NA NA NA NA NA NA NA ...
## $ byes : int [1:133547] NA NA NA NA NA NA NA NA NA NA ...
## $ legbyes : int [1:133547] NA NA NA NA NA NA NA NA NA NA ...
## $ penalty : logi [1:133547] NA NA NA NA NA NA ...
## $ wicket_type : chr [1:133547] "" "" "" "" ...
## $ player_dismissed : chr [1:133547] "" "" "" "" ...
## $ other_wicket_type : chr [1:133547] "" "" "" "" ...
## $ other_player_dismissed: chr [1:133547] "" "" "" "" ...
## $ .groups : chr [1:133547] "drop" "drop" "drop" "drop" ...
#Filtering to last season
bbl_mens_data <- bbl_mens_match %>%
filter(season == "2023/24")
#Looking at the data format
str(bbl_mens_data)
## tibble [9,277 × 33] (S3: tbl_df/tbl/data.frame)
## $ match_id : int [1:9277] 1386094 1386094 1386094 1386094 1386094 1386094 1386094 1386094 1386094 1386094 ...
## $ season : chr [1:9277] "2023/24" "2023/24" "2023/24" "2023/24" ...
## $ start_date : Date[1:9277], format: "2023-12-07" "2023-12-07" ...
## $ venue : chr [1:9277] "Brisbane Cricket Ground, Woolloongabba, Brisbane" "Brisbane Cricket Ground, Woolloongabba, Brisbane" "Brisbane Cricket Ground, Woolloongabba, Brisbane" "Brisbane Cricket Ground, Woolloongabba, Brisbane" ...
## $ innings : int [1:9277] 1 1 1 1 1 1 1 1 1 1 ...
## $ over : num [1:9277] 1 1 1 1 1 1 2 2 2 2 ...
## $ ball : int [1:9277] 1 2 3 4 5 6 1 2 3 4 ...
## $ batting_team : chr [1:9277] "Brisbane Heat" "Brisbane Heat" "Brisbane Heat" "Brisbane Heat" ...
## $ bowling_team : chr [1:9277] "Melbourne Stars" "Melbourne Stars" "Melbourne Stars" "Melbourne Stars" ...
## $ striker : chr [1:9277] "UT Khawaja" "UT Khawaja" "UT Khawaja" "UT Khawaja" ...
## $ non_striker : chr [1:9277] "C Munro" "C Munro" "C Munro" "C Munro" ...
## $ bowler : chr [1:9277] "OP Stone" "OP Stone" "OP Stone" "OP Stone" ...
## $ runs_off_bat : int [1:9277] 2 0 0 1 0 0 4 2 1 4 ...
## $ extras : int [1:9277] 0 0 0 0 0 0 0 0 0 0 ...
## $ ball_in_over : int [1:9277] 1 2 3 4 5 6 1 2 3 4 ...
## $ extra_ball : logi [1:9277] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ balls_remaining : num [1:9277] 119 118 117 116 115 114 113 112 111 110 ...
## $ runs_scored_yet : int [1:9277] 2 2 2 3 3 3 7 9 10 14 ...
## $ wicket : logi [1:9277] FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ wickets_lost_yet : int [1:9277] 0 0 0 0 0 0 0 0 0 0 ...
## $ innings1_total : int [1:9277] 214 214 214 214 214 214 214 214 214 214 ...
## $ innings2_total : int [1:9277] 111 111 111 111 111 111 111 111 111 111 ...
## $ target : num [1:9277] 215 215 215 215 215 215 215 215 215 215 ...
## $ wides : int [1:9277] NA NA NA NA NA NA NA NA NA NA ...
## $ noballs : int [1:9277] NA NA NA NA NA NA NA NA NA NA ...
## $ byes : int [1:9277] NA NA NA NA NA NA NA NA NA NA ...
## $ legbyes : int [1:9277] NA NA NA NA NA NA NA NA NA NA ...
## $ penalty : logi [1:9277] NA NA NA NA NA NA ...
## $ wicket_type : chr [1:9277] "" "" "" "" ...
## $ player_dismissed : chr [1:9277] "" "" "" "" ...
## $ other_wicket_type : chr [1:9277] "" "" "" "" ...
## $ other_player_dismissed: chr [1:9277] "" "" "" "" ...
## $ .groups : chr [1:9277] "drop" "drop" "drop" "drop" ...
Before we start data wrangling let’s think about what we want to portray.
Let’s do some basic visualizations of player stats per game and averages. To do this we will need to clean our data.
Now that we know what we want to do let’s get into data
wrangling:
First create a new column that gives us the Team matchups. This will
allow us to look at player data on a game level. We will also be able to
generate average data later.
#Let's create a new column for team matchups
bbl_mens_data <- bbl_mens_data %>%
mutate(match_teams = paste(batting_team, "vs", bowling_team))
Next lets Create a summary stats for players in each game:
# Batting player summary
batting_player_summary <- bbl_mens_data %>%
group_by(match_id, match_teams, striker) %>%
summarize(
total_runs = sum(runs_off_bat, na.rm = TRUE),
boundaries = sum(if_else(runs_off_bat >= 4 & runs_off_bat < 6, 1, 0), na.rm = TRUE),
sixes = sum(if_else(runs_off_bat == 6, 1, 0), na.rm = TRUE),
total_balls_faced = n(),
method_of_dismissal = ifelse(all(wicket_type == ""), "not out", paste(unique(wicket_type), collapse = "")), # Show "not out" if blank
.groups = "drop"
) %>%
rename(player = striker) %>%
mutate(role = "batsman",
strike_rate = (total_runs / total_balls_faced) * 100, # Calculate strike rate
team = gsub(" vs .*", "", match_teams) # Extract the first team using gsub
)
# Bowling player summary
bowling_player_summary <- bbl_mens_data %>%
group_by(match_id, match_teams, bowler) %>%
summarize(
total_wickets = sum(wicket, na.rm = TRUE),
total_balls_bowled = n(), # Count total deliveries bowled
total_runs_conceded = sum(runs_off_bat + extras, na.rm = TRUE), # Calculate total runs conceded
caught = sum(wicket_type == "caught", na.rm = TRUE),
run_out = sum(wicket_type == "run out", na.rm = TRUE),
bowled = sum(wicket_type == "bowled", na.rm = TRUE),
stumped = sum(wicket_type == "stumped", na.rm = TRUE),
lbw = sum(wicket_type == "lbw", na.rm = TRUE),
caught_and_bowled = sum(wicket_type == "caught and bowled", na.rm = TRUE),
hit_wicket = sum(wicket_type == "hit wicket", na.rm = TRUE),
obstructing_the_field = sum(wicket_type == "obstructing the field", na.rm = TRUE),
.groups = "drop"
) %>%
rename(player = bowler) %>%
mutate(role = "bowler",
total_overs_bowled = total_balls_bowled / 6, # Convert balls to overs
economy_rate = total_runs_conceded / total_overs_bowled, # Calculate economy rate
team = gsub(".* vs ", "", match_teams) # Extract the second team
)
For batters, the batting_player_summary
includes key metrics such as total runs scored, the number of boundaries
and sixes hit, and the total balls faced, which together provide
insights into scoring efficiency. Additionally, we’ve introduced a
method of dissmisal column that indicates whether a player was “not out”
or specifies the type of wicket when applicable. The strike rate has
also been calculated to assess each batter’s scoring speed, presenting a
clear picture of their performance in the match context.
On the bowling side, the
bowling_player_summary captures essential statistics such
as the total wickets taken, total balls bowled, and total runs conceded,
allowing for an evaluation of a bowler’s effectiveness. We have also
included detailed wicket type counts to provide insights into how
wickets were achieved. Furthermore, the summary features the total overs
bowled and the economy rate, which helps gauge the bowler’s ability to
restrict runs during their overs.
Together, these metrics provide an overview of player performances in the matches from the 2023/24 season of the BBL.
Let’s use geom_point to represent runs scored vs balls
faced for the batters in the Melbourne Stars vs Rengades Game
# Filter for the specific match (Melbourne Stars vs Renegades)
stars_vs_renegades <- batting_player_summary %>%
filter(match_teams == c("Melbourne Stars vs Melbourne Renegades", "Melbourne Renegades vs Melbourne Stars")) # Filter for the specific match
# Create the scatter plot
ggplot(stars_vs_renegades,
aes(x = total_balls_faced, y = total_runs,
color = team)) +
geom_point(size = 1) + # Points for each player
geom_text(aes(label = player), vjust = -1, hjust = 0.5, size = 4) + # Add player names as labels
labs(
title = "Runs Scored vs Balls Faced: Melbourne Stars vs Renegades",
x = "Total Balls Faced",
y = "Total Runs Scored"
) +
scale_color_manual(values = c("red", "green")) + # Specify colors for teams
theme_light()
We can see that the plot doesn’t look very nice and is quite
clustered.
Lets use a Facet Wrap
# Create the scatter plot with facets
ggplot(stars_vs_renegades,
aes(x = total_balls_faced, y = total_runs)) +
geom_point(size = 3) + # Increase point size for better visibility
geom_text_repel(aes(label = player), vjust = -1, hjust = 0.5, size = 4) + # Add player names as labels
labs(
title = "Runs Scored vs Balls Faced: Melbourne Stars vs Renegades",
x = "Total Balls Faced",
y = "Total Runs Scored"
) +
theme_light() + # Use a clean theme for clarity
theme(plot.title = element_text(hjust = 0.5)) + # Center the title
facet_wrap(~ team) # Facet by team
The geom_point does an okay job at portraying the data
but let’s use a geom_jitter since it does a slightly better
job with data points that are clustered. Let’s also explore the
theme_minimal from ggplot2 which adjusts how
the graph backgrounds look. There are a range of different themes that
can be used and they can also be manually set up too.
ggplot(stars_vs_renegades, aes(x = total_balls_faced, y = total_runs, color = team)) +
geom_jitter(size = 3, width = 0.1, height = 0) +
geom_text_repel(aes(label = player), size = 4) +
labs(
title = "Runs Scored vs Balls Faced: Melbourne Stars vs Renegades",
x = "Total Balls Faced",
y = "Total Runs Scored"
) +
scale_color_manual(values = c("Melbourne Stars" = "lightgreen", "Melbourne Renegades" = "red")) + # Specify colors for teams
theme_minimal()
Lets Facet Wrap this now
ggplot(stars_vs_renegades, aes(x = total_balls_faced, y = total_runs, color = team)) +
geom_jitter(size = 3, width = 0.1, height = 0) +
geom_text_repel(aes(label = player), size = 4) +
labs(
title = "Runs Scored vs Balls Faced: Melbourne Stars vs Renegades",
x = "Total Balls Faced",
y = "Total Runs Scored"
) +
scale_color_manual(values = c("Melbourne Stars" = "lightgreen", "Melbourne Renegades" = "red")) + # Specify colors for teams
theme_minimal() +
facet_wrap(~ team, scales = "free") # Free scales for x and y axes
Now there are various different ways we can do a deep dive into the batting data and look at other plots. Let’s move into the bowling data.
Now let’s filter our data to one game again and look at only one team this time.
Let’s look at the Sydney Sixers bowling data when they play the Sydney Thunder both times.
sixers_thunder <- bowling_player_summary %>%
filter(match_teams == "Sydney Thunder vs Sydney Sixers")
str(sixers_thunder)
## tibble [13 × 18] (S3: tbl_df/tbl/data.frame)
## $ match_id : int [1:13] 1386112 1386112 1386112 1386112 1386112 1386112 1386112 1386127 1386127 1386127 ...
## $ match_teams : chr [1:13] "Sydney Thunder vs Sydney Sixers" "Sydney Thunder vs Sydney Sixers" "Sydney Thunder vs Sydney Sixers" "Sydney Thunder vs Sydney Sixers" ...
## $ player : chr [1:13] "BJ Dwarshuis" "J Edwards" "JA Davies" "JM Bird" ...
## $ total_wickets : int [1:13] 1 3 0 2 0 1 0 1 3 1 ...
## $ total_balls_bowled : int [1:13] 24 27 12 24 6 24 6 24 17 15 ...
## $ total_runs_conceded : int [1:13] 24 26 9 29 7 47 9 17 25 17 ...
## $ caught : int [1:13] 1 3 0 2 0 1 0 1 2 1 ...
## $ run_out : int [1:13] 0 0 0 0 0 0 0 0 1 0 ...
## $ bowled : int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
## $ stumped : int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
## $ lbw : int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
## $ caught_and_bowled : int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
## $ hit_wicket : int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
## $ obstructing_the_field: int [1:13] 0 0 0 0 0 0 0 0 0 0 ...
## $ role : chr [1:13] "bowler" "bowler" "bowler" "bowler" ...
## $ total_overs_bowled : num [1:13] 4 4.5 2 4 1 ...
## $ economy_rate : num [1:13] 6 5.78 4.5 7.25 7 ...
## $ team : chr [1:13] "Sydney Sixers" "Sydney Sixers" "Sydney Sixers" "Sydney Sixers" ...
Let’s explore some economy rates vs wickets taken to determine the best bowler
# Create a bar plot of economy rates vs. total wickets taken
ggplot(sixers_thunder, aes(x = player, y = economy_rate)) +
geom_bar(stat = "identity", fill = "lightpink") + # Use a single fill color for the bars
geom_text(aes(label = total_wickets),
vjust = -0.5, size = 4) + # Add total wickets as labels on top of bars
labs(
title = "Economy Rate vs Total Wickets: Best Bowlers",
x = "Bowlers",
y = "Economy Rate"
) +
theme_minimal() +
coord_flip() + # Flip coordinates for better readability
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Adjust text angle for better visibility
So as we can see some players have 2 values, which means certain ones played in both games. Let’s use the facet wrap again by the match again.
# Create a bar plot of economy rates vs. total wickets taken
ggplot(sixers_thunder, aes(x = player, y = economy_rate)) +
geom_bar(stat = "identity", fill = "lightpink") + # Use a single fill color for the bars
geom_text(aes(label = total_wickets),
vjust = +0.5, hjust = +1, size = 4) + # Add total wickets as labels on top of bars
labs(
title = "Economy Rate vs Total Wickets from the Sixers",
x = "Bowlers",
y = "Economy Rate"
) +
theme_minimal() +
coord_flip() + # Flip coordinates for better readability
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
facet_wrap(~ match_id) # Create separate plots for each match
That looks a lot better.
Let’s now quickly look at some average stats for the bowlers from each team. We’ll use a facet wrap to separate by team and look at the economy rates and wickets taken on average per game.
Let’s begin by generating the averages.
bowling_avg <- bowling_player_summary %>%
group_by(player, team) %>%
summarise(avg_wickets = round(mean(total_wickets), 2),
avg_economy = round(mean(economy_rate)), 2)
## `summarise()` has grouped output by 'player'. You can override using the
## `.groups` argument.
Now that we have these metrics let’s visualize them.
ggplot(bowling_avg, aes(x = player, y = avg_economy)) +
geom_bar(stat = "identity") + # Use a single fill color for the bars
geom_text(aes(label = avg_wickets),
vjust = +0.5, hjust = +1, size = 4) + # Add total wickets as labels on top of bars
labs(
title = "Economy Rate vs Total Wickets for the comp",
x = "Bowlers",
y = "Economy Rate"
) +
theme_minimal() +
coord_flip() + # Flip coordinates for better readability
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
facet_wrap(~ team, scales = "free") # Create separate plots for each match
This actually has pretty poor visibility due to the amount of data
trying to be portrayed in the small space.
Let’s use the geom_jitter with
geom_text_repel
ggplot(bowling_avg, aes(x = avg_economy, y = avg_wickets)) +
geom_jitter(width = 0.1, height = 0, size = 1) + # Jitter plot for better separation
geom_text_repel(aes(label = player), vjust = -0.5, hjust = 1, size = 2) + # Add total wickets as labels
labs(
title = "Average Economy Rate vs Average Wickets for the Competition",
x = "Economy Rate",
y = "Wickets"
) +
theme_minimal() + # Flip coordinates for readability
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
facet_wrap(~ team, ncol = 4, nrow = 3, scales = "free") # Adjust columns for a more vertical layout
This looks a bit more readable, and now further exploration at the team level could focus on the types of wickets players take.
We could delve even deeper by examining whether a particular bowler consistently takes the wicket of a specific batter throughout the competition. Additionally, we can investigate how economy rates change depending on the opposition, which may suggest that a certain bowler could perform better in Supercoach against a specific opponent. We can also explore how different grounds impact scoring and the types of bowlers taking wickets (spin vs. pace). From this document, you should have enough information on the data manipulation and visualizations that R is capable of.
Happy coding!