Introduction

This Data Dive explores the IPL Player Performance Dataset by examining group‑level patterns, calculating probabilities, and identifying rare combinations.

Grouped summaries of key categorical variables.
Probability‑based interpretation of common vs. rare groups
Identification and tagging of lowest‑probability (anomalous) groups
Exploration of combinations of two categorical variables

Each section includes insights, its significance and further questions

ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")

## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): player, team, match_outcome, opposition_team, venue
## dbl  (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note: Data Preparation

>The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.

IPL <- ipl_raw |>
  mutate(
    date = as.Date(date),
    season = year(date)
  ) |>
  filter(season < 2025)

Grouped summaries of key categorical variables

Group 1: Group by Venue-Level Analysis

Group the data by venue and compute row counts, total sixes, probabilities, and identify the lowest-probability venue .

df_1 <- IPL |>
  group_by(venue) |>
  summarise(
     row_count= n(),
    total_sixes = sum(sixes, na.rm = TRUE)
  )|>
mutate(
    short_venue = substr(venue, 1, 20),
    prob = row_count / sum(row_count),              
    tag = if_else(prob == min(prob), 
                  "lowest_prob", 
                  "common")           
  ) |>
  arrange(prob) 
df_1 |>
  select(short_venue, row_count, total_sixes, prob, tag)

## # A tibble: 58 × 5
##    short_venue          row_count total_sixes    prob tag        
##    <chr>                    <int>       <dbl>   <dbl> <chr>      
##  1 OUTsurance Oval             43           8 0.00180 lowest_prob
##  2 Dr. Y.S. Rajasekhara        48          45 0.00201 common     
##  3 Buffalo Park                63          27 0.00263 common     
##  4 De Beers Diamond Ova        65          34 0.00272 common     
##  5 Vidarbha Cricket Ass        66          26 0.00276 common     
##  6 Barsapara Cricket St        72          32 0.00301 common     
##  7 Green Park                  86          36 0.00359 common     
##  8 Himachal Pradesh Cri        99          75 0.00414 common     
##  9 Nehru Stadium              109          39 0.00456 common     
## 10 Punjab Cricket Assoc       117          76 0.00489 common     
## # ℹ 48 more rows

With \({prob}=0.00179\), the lowest‑probability venue “outsurance Oval” accounts for only 0.179% of all player‑innings, indicating it is rarely used in IPLas venue.
Venue usage is highly uneven: some stadiums host hundreds of innings, while others appear only a few times.
Hypothesis:

-   **Venues with lower row_count values host fewer IPL matches per season.**

-   This can be tested by comparing match counts per venue across seasons.

df_1_top5 <- df_1 |>
  slice_max(total_sixes, n = 5)

df_1_top5 |>
  ggplot() +
  geom_col(
    mapping = aes(
      x = reorder(venue, total_sixes),
      y = total_sixes,
      fill = venue
    ),
    show.legend = FALSE
  ) +
  coord_flip() +
    geom_text(
    aes(x = venue, y = total_sixes, label = total_sixes),
    vjust = -0.5,
    size = 4
  )+
  labs(
    title = "Top 5 IPL Venues by Total Sixes",
    x = "Venue",
    y = "Total Sixes"
  ) +
  theme_minimal()

The bar chart highlights the top five venues with the highest total sixes, showing which stadiums are known for big‑hitting batting paradise.
The graph suggests that certain venues are significantly more six‑friendly, likely due to pitch conditions, boundary size, or batting‑friendly ground characteristics.

Group 2: Group by Team-Level Analysis

Group the data by team and compute row counts, total sixes, probabilities, and identify the lowest-probability team.

NOTE: Standardized Team names

>Before grouping by team, team names are standardized to avoid duplication caused by franchise renames and spelling inconsistencies. For example, ‘Delhi Capitals’ and ‘Delhi Daredevils’ were unified as ‘Delhi Daredevils’, and ‘Rising Pune Supergiants’ was corrected to ‘Rising Pune Supergiant’. This ensured accurate row counts and probability calculations

IPL <- IPL |>
  mutate(team = case_when(
    str_detect(team, "Delhi") ~ "Delhi Daredevils",
    str_detect(team, "Punjab") ~ "Kings XI Punjab",
    str_detect(team, "Rising Pune") ~ "Rising Pune Supergiant",
    str_detect(team, "Royal Challengers") ~ "Royal Challengers Bangalore",
    TRUE ~ team
  ))

df_2 <- IPL |>
  group_by(team) |>
  summarise(
     row_count= n(),
    total_sixes = sum(sixes, na.rm = TRUE)
  )|>
mutate(
    short_team = substr(team, 1, 20),
    prob = row_count / sum(row_count),              
    tag = if_else(prob == min(prob), 
                  "lowest_prob", 
                  "common")           
  ) |>
  arrange(prob) 
df_2|>
   select(short_team, row_count, total_sixes, prob, tag)

## # A tibble: 15 × 5
##    short_team           row_count total_sixes    prob tag        
##    <chr>                    <int>       <dbl>   <dbl> <chr>      
##  1 Kochi Tuskers Kerala       151          53 0.00631 lowest_prob
##  2 Rising Pune Supergia       323         157 0.0135  common     
##  3 Gujarat Lions              326         155 0.0136  common     
##  4 Pune Warriors              493         196 0.0206  common     
##  5 Lucknow Super Giants       508         332 0.0212  common     
##  6 Gujarat Titans             512         271 0.0214  common     
##  7 Deccan Chargers            806         400 0.0337  common     
##  8 Sunrisers Hyderabad       1991        1042 0.0832  common     
##  9 Rajasthan Royals          2417        1237 0.101   common     
## 10 Chennai Super Kings       2571        1509 0.107   common     
## 11 Kings XI Punjab           2703        1515 0.113   common     
## 12 Kolkata Knight Rider      2726        1495 0.114   common     
## 13 Royal Challengers Ba      2764        1653 0.116   common     
## 14 Delhi Daredevils          2770        1351 0.116   common     
## 15 Mumbai Indians            2864        1685 0.120   common

With \({prob}=0.00631\), the lowest‑probability team “Kochi Tuskers” accounts for only 0.631% of all player‑innings, indicating it participated in relatively few IPL seasons as compared to other franchises.
Teams like Rising Pune Supergiant, Kochi Tuskers Kerala, and Gujarat Lions appear only briefly indicates they are part of only few IPL seasons .
Hypothesis:Teams with lower row_count values participated in fewer IPL seasons.

df_2 |>
  ggplot() +
  geom_col(
    mapping = aes(
      x = reorder(team, total_sixes),
      y = total_sixes,
      fill = team
    ),
    show.legend = FALSE
  ) +
  coord_flip() +
   geom_text(
    aes(x = team, y = total_sixes, label = total_sixes),
    vjust = -0.5,
    size = 2.5
  )+
  labs(
    title = "IPL Teams by Total Sixes",
    x = "Team",
    y = "Total Sixes"
  ) +
  theme_minimal()

The bar chart shows the full distribution of six‑hitting across every teams showing which franchises consistently produce the most aggressive batting performances.
The wide variation in bar heights highlights the uneven number of seasons played by different teams, mainly temporary teams that existed for only one or two seasons.

Group 3: Group by Year (season) Analysis

Group the data by season and compute row counts, total sixes, probabilities, and identify the lowest-probability.

NOTE: The season column was already created during data preparation, so grouping 3 directly uses it for summarizing sixes by year.

df_3 <- IPL |>
  group_by(season) |>
  summarise(
    row_count = n(),
    total_sixes = sum(sixes, na.rm = TRUE)
  ) |>
  mutate(
    prob = row_count / sum(row_count),
    tag = if_else(prob == min(prob), "lowest_prob", "common")
  )|>
arrange(prob)
df_3

## # A tibble: 17 × 5
##    season row_count total_sixes   prob tag        
##     <dbl>     <int>       <dbl>  <dbl> <chr>      
##  1   2009      1218         508 0.0509 lowest_prob
##  2   2008      1243         623 0.0520 common     
##  3   2015      1269         692 0.0530 common     
##  4   2017      1285         706 0.0537 common     
##  5   2014      1293         715 0.0540 common     
##  6   2010      1294         587 0.0541 common     
##  7   2016      1296         639 0.0542 common     
##  8   2019      1301         786 0.0544 common     
##  9   2018      1303         872 0.0545 common     
## 10   2020      1305         736 0.0545 common     
## 11   2021      1312         687 0.0548 common     
## 12   2011      1549         639 0.0647 common     
## 13   2012      1589         733 0.0664 common     
## 14   2022      1617        1062 0.0676 common     
## 15   2013      1634         681 0.0683 common     
## 16   2024      1677        1261 0.0701 common     
## 17   2023      1740        1124 0.0727 common

With \({prob}=0.0509\), the lowest‑probability season “2009” accounts for only 5.09% of all player‑innings, indicating 2009 season has fewest number of IPL matches as compared to other seasons.
Seasons with very low row_count values typically correspond to shortened tournaments or early IPL years when the league structure and number of teams was still evolving.

Hypothesis:Teams count has been increased in past few season of league resulting increase in number of matches.

df_3_last5 <- df_3 |>
  filter(season >= max(season) - 4)

df_3_last5 |>
  ggplot() +
  geom_col(aes(x = season, y = total_sixes), fill = "skyblue") +
  geom_text(
    aes(x = season, y = total_sixes, label = total_sixes),
    vjust = -0.5,
    size = 4
  ) +
  labs(
    title = "Total Sixes in the Last 5 IPL Seasons",
    x = "Season",
    y = "Total Sixes"
  ) +
  theme_minimal()

The bar chart displays the total number of sixes across the most recent five IPL seasons.
Focusing only on the last five seasons keeps the visualization simple and meaningful.,

The plot highligts recent batting trends and indicates highest six‑hitting output is one increasing trend season by season.

Combinations of two categorical variables

Creating all possible Team × Season combinations

The combination of IPL teams and seasons by generating all possible team–season combinations and comparing them with the actual match‑level data explores missing combinations and examins the most and least common ones, from this we gain insight into franchise participation patterns across IPL history.

all_combos <- expand_grid(
  team   = unique(IPL$team),
  season = unique(IPL$season)
)
head(all_combos)

## # A tibble: 6 × 2
##   team                        season
##   <chr>                        <dbl>
## 1 Royal Challengers Bangalore   2013
## 2 Royal Challengers Bangalore   2008
## 3 Royal Challengers Bangalore   2011
## 4 Royal Challengers Bangalore   2022
## 5 Royal Challengers Bangalore   2024
## 6 Royal Challengers Bangalore   2016

Counting actual Team × Season combinations in the data

match_counts <- IPL |>
  distinct(match_id, team, season) |>
  count(team, season, name = "match_count")
head(match_counts)

## # A tibble: 6 × 3
##   team                season match_count
##   <chr>                <dbl>       <int>
## 1 Chennai Super Kings   2008          16
## 2 Chennai Super Kings   2009          14
## 3 Chennai Super Kings   2010          16
## 4 Chennai Super Kings   2011          16
## 5 Chennai Super Kings   2012          18
## 6 Chennai Super Kings   2013          18

Identify missing combinations

missing_team_season <- all_combos |>
  anti_join(match_counts, by = c("team", "season"))
missing_team_season |> 
  arrange(season)

## # A tibble: 109 × 2
##    team                   season
##    <chr>                   <dbl>
##  1 Lucknow Super Giants     2008
##  2 Gujarat Titans           2008
##  3 Sunrisers Hyderabad      2008
##  4 Pune Warriors            2008
##  5 Gujarat Lions            2008
##  6 Rising Pune Supergiant   2008
##  7 Kochi Tuskers Kerala     2008
##  8 Lucknow Super Giants     2009
##  9 Gujarat Titans           2009
## 10 Sunrisers Hyderabad      2009
## # ℹ 99 more rows

Most and least common combinations

match_sorted <- match_counts |> 
  arrange(desc(match_count))

#most common
head(match_sorted, 10)

## # A tibble: 10 × 3
##    team                season match_count
##    <chr>                <dbl>       <int>
##  1 Mumbai Indians        2013          19
##  2 Chennai Super Kings   2012          18
##  3 Chennai Super Kings   2013          18
##  4 Delhi Daredevils      2012          18
##  5 Rajasthan Royals      2013          18
##  6 Chennai Super Kings   2015          17
##  7 Chennai Super Kings   2019          17
##  8 Delhi Daredevils      2020          17
##  9 Gujarat Titans        2023          17
## 10 Kings XI Punjab       2014          17

#least common
slice_min(match_counts, match_count, n = 10)

## # A tibble: 73 × 3
##    team                        season match_count
##    <chr>                        <dbl>       <int>
##  1 Gujarat Titans                2024          12
##  2 Kolkata Knight Riders         2008          13
##  3 Kolkata Knight Riders         2009          13
##  4 Kolkata Knight Riders         2015          13
##  5 Mumbai Indians                2009          13
##  6 Rajasthan Royals              2009          13
##  7 Rajasthan Royals              2011          13
##  8 Royal Challengers Bangalore   2017          13
##  9 Chennai Super Kings           2009          14
## 10 Chennai Super Kings           2020          14
## # ℹ 63 more rows

Several team–season combinations do not appear in the data because certain franchises were introduced later (e.g., Gujarat Titans, Lucknow Super Giants) or dissolved earlier (e.g., Deccan Chargers, Kochi Tuskers Kerala).
The most common combinations correspond to long‑standing teams like Mumbai Indians, Chennai Super Kings, and Kolkata Knight Riders in seasons.
The least common combinations occur in early IPL seasons (2008–2010), shortened seasons (e.g., 2020), or seasons where teams played fewer matches due to format changes.
Missing combinations reveal when teams did not exist or were inactive. Together, these patterns provide a clear view of Teams participation in IPL across years.
Further question: Do newly introduced teams show better performance patterns compared to long‑standing franchises?

match_counts |>
  ggplot() +
  geom_tile(aes(x = factor(season), y = team, fill = match_count)) +
  scale_fill_viridis_c() +
  labs(
    title = "Team Participation Across IPL Seasons (Match-Level)",
    x = "Season",
    y = "Team",
    fill = "Matches Played"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

The heatmap displays each IPL season along the x‑axis and each team along the y‑axis, with color intensity representing how many matches a team played in that season.
Blank tiles reveal missing team–season combinations, new franchises, and discontinued teams.

H510_Week3_Datadive

Mayank Gupta

2026-02-02