This Data Dive explores the IPL Player Performance Dataset by examining group‑level patterns, calculating probabilities, and identifying rare combinations.
Grouped summaries of key categorical variables.
Probability‑based interpretation of common vs. rare groups
Identification and tagging of lowest‑probability (anomalous) groups
Exploration of combinations of two categorical variables
Each section includes insights, its significance and further questions
ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")
## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): player, team, match_outcome, opposition_team, venue
## dbl (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note: Data Preparation
>The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.
IPL <- ipl_raw |>
mutate(
date = as.Date(date),
season = year(date)
) |>
filter(season < 2025)
Group the data by venue and compute row counts, total sixes, probabilities, and identify the lowest-probability venue .
df_1 <- IPL |>
group_by(venue) |>
summarise(
row_count= n(),
total_sixes = sum(sixes, na.rm = TRUE)
)|>
mutate(
short_venue = substr(venue, 1, 20),
prob = row_count / sum(row_count),
tag = if_else(prob == min(prob),
"lowest_prob",
"common")
) |>
arrange(prob)
df_1 |>
select(short_venue, row_count, total_sixes, prob, tag)
## # A tibble: 58 × 5
## short_venue row_count total_sixes prob tag
## <chr> <int> <dbl> <dbl> <chr>
## 1 OUTsurance Oval 43 8 0.00180 lowest_prob
## 2 Dr. Y.S. Rajasekhara 48 45 0.00201 common
## 3 Buffalo Park 63 27 0.00263 common
## 4 De Beers Diamond Ova 65 34 0.00272 common
## 5 Vidarbha Cricket Ass 66 26 0.00276 common
## 6 Barsapara Cricket St 72 32 0.00301 common
## 7 Green Park 86 36 0.00359 common
## 8 Himachal Pradesh Cri 99 75 0.00414 common
## 9 Nehru Stadium 109 39 0.00456 common
## 10 Punjab Cricket Assoc 117 76 0.00489 common
## # ℹ 48 more rows
With \({prob}=0.00179\), the lowest‑probability venue “outsurance Oval” accounts for only 0.179% of all player‑innings, indicating it is rarely used in IPLas venue.
Venue usage is highly uneven: some stadiums host hundreds of innings, while others appear only a few times.
Hypothesis:
- **Venues with lower row_count values host fewer IPL matches per season.**
- This can be tested by comparing match counts per venue across seasons.
df_1_top5 <- df_1 |>
slice_max(total_sixes, n = 5)
df_1_top5 |>
ggplot() +
geom_col(
mapping = aes(
x = reorder(venue, total_sixes),
y = total_sixes,
fill = venue
),
show.legend = FALSE
) +
coord_flip() +
geom_text(
aes(x = venue, y = total_sixes, label = total_sixes),
vjust = -0.5,
size = 4
)+
labs(
title = "Top 5 IPL Venues by Total Sixes",
x = "Venue",
y = "Total Sixes"
) +
theme_minimal()
The bar chart highlights the top five venues with the highest total sixes, showing which stadiums are known for big‑hitting batting paradise.
The graph suggests that certain venues are significantly more six‑friendly, likely due to pitch conditions, boundary size, or batting‑friendly ground characteristics.
Group the data by team and compute row counts, total sixes, probabilities, and identify the lowest-probability team.
NOTE: Standardized Team names
>Before grouping by team, team names are standardized to avoid duplication caused by franchise renames and spelling inconsistencies. For example, ‘Delhi Capitals’ and ‘Delhi Daredevils’ were unified as ‘Delhi Daredevils’, and ‘Rising Pune Supergiants’ was corrected to ‘Rising Pune Supergiant’. This ensured accurate row counts and probability calculations
IPL <- IPL |>
mutate(team = case_when(
str_detect(team, "Delhi") ~ "Delhi Daredevils",
str_detect(team, "Punjab") ~ "Kings XI Punjab",
str_detect(team, "Rising Pune") ~ "Rising Pune Supergiant",
str_detect(team, "Royal Challengers") ~ "Royal Challengers Bangalore",
TRUE ~ team
))
df_2 <- IPL |>
group_by(team) |>
summarise(
row_count= n(),
total_sixes = sum(sixes, na.rm = TRUE)
)|>
mutate(
short_team = substr(team, 1, 20),
prob = row_count / sum(row_count),
tag = if_else(prob == min(prob),
"lowest_prob",
"common")
) |>
arrange(prob)
df_2|>
select(short_team, row_count, total_sixes, prob, tag)
## # A tibble: 15 × 5
## short_team row_count total_sixes prob tag
## <chr> <int> <dbl> <dbl> <chr>
## 1 Kochi Tuskers Kerala 151 53 0.00631 lowest_prob
## 2 Rising Pune Supergia 323 157 0.0135 common
## 3 Gujarat Lions 326 155 0.0136 common
## 4 Pune Warriors 493 196 0.0206 common
## 5 Lucknow Super Giants 508 332 0.0212 common
## 6 Gujarat Titans 512 271 0.0214 common
## 7 Deccan Chargers 806 400 0.0337 common
## 8 Sunrisers Hyderabad 1991 1042 0.0832 common
## 9 Rajasthan Royals 2417 1237 0.101 common
## 10 Chennai Super Kings 2571 1509 0.107 common
## 11 Kings XI Punjab 2703 1515 0.113 common
## 12 Kolkata Knight Rider 2726 1495 0.114 common
## 13 Royal Challengers Ba 2764 1653 0.116 common
## 14 Delhi Daredevils 2770 1351 0.116 common
## 15 Mumbai Indians 2864 1685 0.120 common
With \({prob}=0.00631\), the lowest‑probability team “Kochi Tuskers” accounts for only 0.631% of all player‑innings, indicating it participated in relatively few IPL seasons as compared to other franchises.
Teams like Rising Pune Supergiant, Kochi Tuskers Kerala, and Gujarat Lions appear only briefly indicates they are part of only few IPL seasons .
Hypothesis:Teams with lower row_count values participated in fewer IPL seasons.
df_2 |>
ggplot() +
geom_col(
mapping = aes(
x = reorder(team, total_sixes),
y = total_sixes,
fill = team
),
show.legend = FALSE
) +
coord_flip() +
geom_text(
aes(x = team, y = total_sixes, label = total_sixes),
vjust = -0.5,
size = 2.5
)+
labs(
title = "IPL Teams by Total Sixes",
x = "Team",
y = "Total Sixes"
) +
theme_minimal()
The bar chart shows the full distribution of six‑hitting across every teams showing which franchises consistently produce the most aggressive batting performances.
The wide variation in bar heights highlights the uneven number of seasons played by different teams, mainly temporary teams that existed for only one or two seasons.
Group the data by season and compute row counts, total sixes, probabilities, and identify the lowest-probability.
NOTE: The
seasoncolumn was already created during data preparation, so grouping 3 directly uses it for summarizing sixes by year.
df_3 <- IPL |>
group_by(season) |>
summarise(
row_count = n(),
total_sixes = sum(sixes, na.rm = TRUE)
) |>
mutate(
prob = row_count / sum(row_count),
tag = if_else(prob == min(prob), "lowest_prob", "common")
)|>
arrange(prob)
df_3
## # A tibble: 17 × 5
## season row_count total_sixes prob tag
## <dbl> <int> <dbl> <dbl> <chr>
## 1 2009 1218 508 0.0509 lowest_prob
## 2 2008 1243 623 0.0520 common
## 3 2015 1269 692 0.0530 common
## 4 2017 1285 706 0.0537 common
## 5 2014 1293 715 0.0540 common
## 6 2010 1294 587 0.0541 common
## 7 2016 1296 639 0.0542 common
## 8 2019 1301 786 0.0544 common
## 9 2018 1303 872 0.0545 common
## 10 2020 1305 736 0.0545 common
## 11 2021 1312 687 0.0548 common
## 12 2011 1549 639 0.0647 common
## 13 2012 1589 733 0.0664 common
## 14 2022 1617 1062 0.0676 common
## 15 2013 1634 681 0.0683 common
## 16 2024 1677 1261 0.0701 common
## 17 2023 1740 1124 0.0727 common
With \({prob}=0.0509\), the lowest‑probability season “2009” accounts for only 5.09% of all player‑innings, indicating 2009 season has fewest number of IPL matches as compared to other seasons.
Seasons with very low row_count values typically correspond to shortened tournaments or early IPL years when the league structure and number of teams was still evolving.
Hypothesis:Teams count has been increased in past few season of league resulting increase in number of matches.
df_3_last5 <- df_3 |>
filter(season >= max(season) - 4)
df_3_last5 |>
ggplot() +
geom_col(aes(x = season, y = total_sixes), fill = "skyblue") +
geom_text(
aes(x = season, y = total_sixes, label = total_sixes),
vjust = -0.5,
size = 4
) +
labs(
title = "Total Sixes in the Last 5 IPL Seasons",
x = "Season",
y = "Total Sixes"
) +
theme_minimal()
The bar chart displays the total number of sixes across the most recent five IPL seasons.
Focusing only on the last five seasons keeps the visualization simple and meaningful.,
The plot highligts recent batting trends and indicates highest six‑hitting output is one increasing trend season by season.
Creating all possible Team × Season combinations
The combination of IPL teams and seasons by generating all possible team–season combinations and comparing them with the actual match‑level data explores missing combinations and examins the most and least common ones, from this we gain insight into franchise participation patterns across IPL history.
all_combos <- expand_grid(
team = unique(IPL$team),
season = unique(IPL$season)
)
head(all_combos)
## # A tibble: 6 × 2
## team season
## <chr> <dbl>
## 1 Royal Challengers Bangalore 2013
## 2 Royal Challengers Bangalore 2008
## 3 Royal Challengers Bangalore 2011
## 4 Royal Challengers Bangalore 2022
## 5 Royal Challengers Bangalore 2024
## 6 Royal Challengers Bangalore 2016
Counting actual Team × Season combinations in the data
match_counts <- IPL |>
distinct(match_id, team, season) |>
count(team, season, name = "match_count")
head(match_counts)
## # A tibble: 6 × 3
## team season match_count
## <chr> <dbl> <int>
## 1 Chennai Super Kings 2008 16
## 2 Chennai Super Kings 2009 14
## 3 Chennai Super Kings 2010 16
## 4 Chennai Super Kings 2011 16
## 5 Chennai Super Kings 2012 18
## 6 Chennai Super Kings 2013 18Identify missing combinations
missing_team_season <- all_combos |>
anti_join(match_counts, by = c("team", "season"))
missing_team_season |>
arrange(season)
## # A tibble: 109 × 2
## team season
## <chr> <dbl>
## 1 Lucknow Super Giants 2008
## 2 Gujarat Titans 2008
## 3 Sunrisers Hyderabad 2008
## 4 Pune Warriors 2008
## 5 Gujarat Lions 2008
## 6 Rising Pune Supergiant 2008
## 7 Kochi Tuskers Kerala 2008
## 8 Lucknow Super Giants 2009
## 9 Gujarat Titans 2009
## 10 Sunrisers Hyderabad 2009
## # ℹ 99 more rows
Most and least common combinations
match_sorted <- match_counts |>
arrange(desc(match_count))
#most common
head(match_sorted, 10)
## # A tibble: 10 × 3
## team season match_count
## <chr> <dbl> <int>
## 1 Mumbai Indians 2013 19
## 2 Chennai Super Kings 2012 18
## 3 Chennai Super Kings 2013 18
## 4 Delhi Daredevils 2012 18
## 5 Rajasthan Royals 2013 18
## 6 Chennai Super Kings 2015 17
## 7 Chennai Super Kings 2019 17
## 8 Delhi Daredevils 2020 17
## 9 Gujarat Titans 2023 17
## 10 Kings XI Punjab 2014 17
#least common
slice_min(match_counts, match_count, n = 10)
## # A tibble: 73 × 3
## team season match_count
## <chr> <dbl> <int>
## 1 Gujarat Titans 2024 12
## 2 Kolkata Knight Riders 2008 13
## 3 Kolkata Knight Riders 2009 13
## 4 Kolkata Knight Riders 2015 13
## 5 Mumbai Indians 2009 13
## 6 Rajasthan Royals 2009 13
## 7 Rajasthan Royals 2011 13
## 8 Royal Challengers Bangalore 2017 13
## 9 Chennai Super Kings 2009 14
## 10 Chennai Super Kings 2020 14
## # ℹ 63 more rows
Several team–season combinations do not appear in the data because certain franchises were introduced later (e.g., Gujarat Titans, Lucknow Super Giants) or dissolved earlier (e.g., Deccan Chargers, Kochi Tuskers Kerala).
The most common combinations correspond to long‑standing teams like Mumbai Indians, Chennai Super Kings, and Kolkata Knight Riders in seasons.
The least common combinations occur in early IPL seasons (2008–2010), shortened seasons (e.g., 2020), or seasons where teams played fewer matches due to format changes.
Missing combinations reveal when teams did not exist or were inactive. Together, these patterns provide a clear view of Teams participation in IPL across years.
Further question: Do newly introduced teams show better performance patterns compared to long‑standing franchises?
match_counts |>
ggplot() +
geom_tile(aes(x = factor(season), y = team, fill = match_count)) +
scale_fill_viridis_c() +
labs(
title = "Team Participation Across IPL Seasons (Match-Level)",
x = "Season",
y = "Team",
fill = "Matches Played"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1)
)
The heatmap displays each IPL season along the x‑axis and each team along the y‑axis, with color intensity representing how many matches a team played in that season.
Blank tiles reveal missing team–season combinations, new franchises, and discontinued teams.