Introduction

This Datadive explores IPL Player Performance Dataset by running summary statistics, generate visualization and extracting actionable insights.

Numeric summaries of key performance variables
Categorical summaries of players and venues
Three novel analytical questions
Visualizations that highlight trends, correlations, and interactions

Each section includes insights, its significance and further questions

ipl<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")

## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): player, team, match_outcome, opposition_team, venue
## dbl  (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ipl|>
  glimpse()

## Rows: 24,044
## Columns: 22
## $ match_id        <dbl> 598027, 335982, 501244, 1304112, 1426269, 980987, 7339…
## $ player          <chr> "CH Gayle", "BB McCullum", "CH Gayle", "Q de Kock", "S…
## $ team            <chr> "Royal Challengers Bangalore", "Kolkata Knight Riders"…
## $ runs            <dbl> 175, 158, 107, 140, 109, 129, 83, 79, 89, 133, 75, 128…
## $ balls_faced     <dbl> 69, 77, 49, 71, 64, 53, 42, 35, 49, 61, 49, 62, 65, 58…
## $ fours           <dbl> 13, 10, 10, 10, 13, 10, 7, 3, 9, 19, 8, 7, 16, 11, 15,…
## $ sixes           <dbl> 17, 13, 9, 10, 6, 12, 7, 10, 6, 4, 5, 13, 8, 5, 7, 8, …
## $ wickets         <dbl> 1, 0, 3, 0, 2, 0, 4, 4, 3, 0, 4, 0, 0, 2, 0, 0, 0, 0, …
## $ overs_bowled    <dbl> 1, 0, 4, 0, 4, 0, 4, 4, 4, 0, 4, 0, 0, 4, 0, 0, 0, 0, …
## $ balls_bowled    <dbl> 6, 0, 24, 0, 26, 0, 24, 24, 25, 0, 25, 0, 0, 24, 0, 0,…
## $ runs_conceded   <dbl> 5, 0, 21, 0, 27, 0, 35, 30, 17, 0, 28, 0, 0, 25, 0, 0,…
## $ catches         <dbl> 0, 1, 0, 1, 1, 2, 0, 0, 1, 1, 0, 0, 0, 2, 0, 2, 0, 0, …
## $ run_outs        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ maiden          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stumps          <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ match_outcome   <chr> "win", "win", "win", "win", "loss", "win", "loss", "lo…
## $ opposition_team <chr> "Pune Warriors", "Royal Challengers Bangalore", "Kings…
## $ strike_rate     <dbl> 253.62, 205.19, 218.37, 197.18, 170.31, 243.40, 197.62…
## $ economy         <dbl> 5.00, 0.00, 5.25, 0.00, 6.23, 0.00, 8.75, 7.50, 4.08, …
## $ fantasy_points  <dbl> 386, 314, 306, 286, 285, 283, 283, 281, 270, 267, 267,…
## $ venue           <chr> "M Chinnaswamy Stadium", "M Chinnaswamy Stadium", "M C…
## $ date            <date> 2013-04-23, 2008-04-18, 2011-05-06, 2022-05-18, 2024-…

Numeric summary

Runs

It summarizes the distribution of runs scored across all observations.

ipl |>
  summarise(
    min = min(runs, na.rm = TRUE),
    q1 = quantile(runs, 0.25, na.rm = TRUE),
    median = median(runs, na.rm = TRUE),
    mean = mean(runs, na.rm = TRUE),
    q3 = quantile(runs, 0.75, na.rm = TRUE),
    max = max(runs, na.rm = TRUE)
  )

## # A tibble: 1 × 6
##     min    q1 median  mean    q3   max
##   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1     0     0      5  13.8    21   175

Runs are heavily right-skewed since the mean is much higher then median and because q1 is zero and min is zero indicates most innings produces low scores and ends with ducks i.e. with zero runs, high scores meaning 100+ runs are exceptional and stand out performance as they are very few.
This reflects typical T20 batting behavior and helps classify batters as per their scoring efficiency.

we can further analyse which player consistently scores with frequent high scores ?

Wickets

It summarizes the distribution of wickets taken across all observations.

ipl |>
  summarise(
    min = min(wickets, na.rm = TRUE),
    q1 = quantile(wickets, 0.25, na.rm = TRUE),
    median = median(wickets, na.rm = TRUE),
    mean = mean(wickets, na.rm = TRUE),
    q3 = quantile(wickets, 0.75, na.rm = TRUE),
    max = max(wickets, na.rm = TRUE)
  )

## # A tibble: 1 × 6
##     min    q1 median  mean    q3   max
##   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1     0     0      0 0.478     1     6

Given the result most Bowlers takes 0-1 wickets per match , this is expected in T20 cricket as bowlers bowls 4 or less overs. It is again right skewed distribution and as q3 is 1 means 75% of bowlers take less then or equal to 1 wicket per match . so a max like 6 wickets represents a standout performance .
This helps evaluating bowling performance of players and helps analyzing economy rate vs wickets.
we can further analyse do players with lower economy rate take more wickets ? Are certain venues more bowler-friendly?

Categorical summary

Player

ipl|>
  count(player, sort=TRUE)

## # A tibble: 751 × 2
##    player         n
##    <chr>      <int>
##  1 RG Sharma    255
##  2 MS Dhoni     251
##  3 KD Karthik   249
##  4 V Kohli      248
##  5 RA Jadeja    238
##  6 S Dhawan     222
##  7 R Ashwin     211
##  8 SK Raina     201
##  9 RV Uthappa   200
## 10 AT Rayudu    193
## # ℹ 741 more rows

There are 751 unique players as per above result and 200+ appearances reflects fairly long career of top players .
This helps to identify which players have enough data for deeper analysis.
we can further analyse which player played for one team across all seasons and which has played for most number of different teams?

Venue

ipl|>
  count(venue, sort=TRUE)

## # A tibble: 58 × 2
##    venue                                          n
##    <chr>                                      <int>
##  1 Eden Gardens                                1652
##  2 Wankhede Stadium                            1582
##  3 M Chinnaswamy Stadium                       1388
##  4 Feroz Shah Kotla                            1291
##  5 Rajiv Gandhi International Stadium, Uppal   1059
##  6 MA Chidambaram Stadium, Chepauk             1037
##  7 Dubai International Cricket Stadium         1014
##  8 Wankhede Stadium, Mumbai                    1007
##  9 Sawai Mansingh Stadium                      1005
## 10 Punjab Cricket Association Stadium, Mohali   751
## # ℹ 48 more rows

There are 58 unique venues as per above result .
Some Venues host more matches than others.
This helps to identify which venues are high scoring and which are low-scoring ?
This helps to identify which teams have the highest number of match wins at each venue ?

Questions

Q1. Which venues consistently produce higher run totals?

Aggregating runs by venue reveals which grounds are batting friendly vs bowler friendly.

This help teams make strategies based on historical scoring patterns.
Do specific teams perform better at certain venues?

Q2. Do bowlers with lower economy rates tend to take more wickets ?

This explores whether low economy correlates with wicket-taking ability.

This help teams to identify bowler type.
Which bowling type is more suited for different venues?

Q3. Which players achieve boundaries most efficiently based on balls-per-boundary?

Balls per Boundary highlights aggressive , high impact batters.

This help teams identify batters who can score at a fair clip and accelerate run rate.

How does boundary efficiency vary across venues?

A1:- Aggregating runs by venue to reveal which grounds are batting friendly

innings_totals <- ipl |> 
  group_by(match_id, venue, team) |> 
  summarise(innings_runs = sum(runs, na.rm = TRUE), .groups = "drop")

avg_runs_per_venue <- innings_totals |> 
  group_by(venue) |> 
  summarise(avg_innings_runs = mean(innings_runs), .groups = "drop") |> 
  arrange(desc(avg_innings_runs))

avg_runs_per_venue <- avg_runs_per_venue |>
  mutate(short_venue = substr(venue, 1, 25))
head(avg_runs_per_venue |> select(short_venue, avg_innings_runs), 10)

## # A tibble: 10 × 2
##    short_venue                 avg_innings_runs
##    <chr>                                  <dbl>
##  1 "Dr. Y.S. Rajasekhara Redd"             190.
##  2 "Himachal Pradesh Cricket "             182 
##  3 "Arun Jaitley Stadium, Del"             182.
##  4 "M Chinnaswamy Stadium, Be"             180 
##  5 "Rajiv Gandhi Internationa"             180.
##  6 "Eden Gardens, Kolkata"                 179.
##  7 "Punjab Cricket Associatio"             176.
##  8 "Punjab Cricket Associatio"             167.
##  9 "Narendra Modi Stadium, Ah"             165.
## 10 "Wankhede Stadium, Mumbai"              164.

The table ranks top 10 IPL venues from the highest to the lowest average runs scored per innings.
This reveals which grounds are more Batting friendly which helps teams to decide team combination and batter type.

Further analysis :- Which teams perform best at high‑scoring venues ?

Visual Summaries

Visual 1: Histogram of Runs (distribution)

This histogram shows how the IPL top run‑scorer distributes their innings scores when grouped into meaningful 10‑run ranges.

top_scorer <- ipl |> 
  group_by(player) |> 
  summarise(total_runs = sum(runs)) |> 
  arrange(desc(total_runs)) |> 
  slice(1) |> 
  pull(player)

top_player_data <- ipl |> 
  filter(player == top_scorer)

mean_runs_top <- mean(top_player_data$runs)

top_player_data |> 
  ggplot() +
  geom_histogram(
    mapping = aes(x = runs),
    binwidth = 10,
    color = "white",
    fill = "blue"
  ) +
  geom_vline(xintercept = mean_runs_top, color = "orange", linewidth = 1) +
  annotate(
    "text",
    x = mean_runs_top + 5,
    y = 5,
    label = "Average",
    color = "orange"
  ) +
  scale_x_continuous(breaks = seq(0, 160, by = 10)) +
  theme_classic() +
  labs(
    title = paste("Run Distribution for", top_scorer),
    x = "Runs Scored per Innings",
    y = "Count of Innings"
  )

Most of Batters innings fall in the 10–40 run range, with occasional high scores extending beyond 60.
This pattern reflects a consistent scoring base with bursts of big innings that elevate his overall impact

further question: does batters scoring distribution shift in wins versus losses?

Visual 2: Scatter Plot : Runs vs Career Strike rate (Top 10 Players)

How the top 10 run‑scorers balance consistency with aggressiveness in their batting.

player_summary <- ipl |> 
  group_by(player) |> 
  summarise(
    total_runs = sum(runs, na.rm = TRUE),
    total_balls = sum(balls_faced, na.rm = TRUE),
    career_sr = (total_runs / total_balls) * 100
  ) |> 
  arrange(desc(total_runs)) |> 
  slice_head(n = 10)

player_summary |> 
  ggplot() +
  geom_point(aes(x = career_sr, y = total_runs, color = player,
             size = career_sr), alpha = 1) +
  theme_minimal() +
  labs(
    title = "Runs vs Career Strike Rate (Top 10 Run-Scorers)",
    x = "Career Strike Rate",
    y = "Runs"
  )

Among the top 10 run-scorers, players with higher strike rates appear as larger bubbles, revealing which batters combine volume with explosiveness.
- look at the bubble size of AB DeVilliers , only player having more than 5K runs with SR >145
The visualization helps distinguish between high-volume accumulators and high-impact aggressive players.
Further questions :-do these players maintain similar strike rates across different venues ?

H510 Week 1 Data Dive - IPL

Mayank Gupta

2026-01-26