This Datadive explores IPL Player Performance Dataset by running summary statistics, generate visualization and extracting actionable insights.
Numeric summaries of key performance variables
Categorical summaries of players and venuesÂ
Three novel analytical questions
Visualizations that highlight trends, correlations, and interactions
Each section includes insights, its significance and further questions
ipl<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")
## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): player, team, match_outcome, opposition_team, venue
## dbl (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ipl|>
glimpse()
## Rows: 24,044
## Columns: 22
## $ match_id <dbl> 598027, 335982, 501244, 1304112, 1426269, 980987, 7339…
## $ player <chr> "CH Gayle", "BB McCullum", "CH Gayle", "Q de Kock", "S…
## $ team <chr> "Royal Challengers Bangalore", "Kolkata Knight Riders"…
## $ runs <dbl> 175, 158, 107, 140, 109, 129, 83, 79, 89, 133, 75, 128…
## $ balls_faced <dbl> 69, 77, 49, 71, 64, 53, 42, 35, 49, 61, 49, 62, 65, 58…
## $ fours <dbl> 13, 10, 10, 10, 13, 10, 7, 3, 9, 19, 8, 7, 16, 11, 15,…
## $ sixes <dbl> 17, 13, 9, 10, 6, 12, 7, 10, 6, 4, 5, 13, 8, 5, 7, 8, …
## $ wickets <dbl> 1, 0, 3, 0, 2, 0, 4, 4, 3, 0, 4, 0, 0, 2, 0, 0, 0, 0, …
## $ overs_bowled <dbl> 1, 0, 4, 0, 4, 0, 4, 4, 4, 0, 4, 0, 0, 4, 0, 0, 0, 0, …
## $ balls_bowled <dbl> 6, 0, 24, 0, 26, 0, 24, 24, 25, 0, 25, 0, 0, 24, 0, 0,…
## $ runs_conceded <dbl> 5, 0, 21, 0, 27, 0, 35, 30, 17, 0, 28, 0, 0, 25, 0, 0,…
## $ catches <dbl> 0, 1, 0, 1, 1, 2, 0, 0, 1, 1, 0, 0, 0, 2, 0, 2, 0, 0, …
## $ run_outs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ maiden <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stumps <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ match_outcome <chr> "win", "win", "win", "win", "loss", "win", "loss", "lo…
## $ opposition_team <chr> "Pune Warriors", "Royal Challengers Bangalore", "Kings…
## $ strike_rate <dbl> 253.62, 205.19, 218.37, 197.18, 170.31, 243.40, 197.62…
## $ economy <dbl> 5.00, 0.00, 5.25, 0.00, 6.23, 0.00, 8.75, 7.50, 4.08, …
## $ fantasy_points <dbl> 386, 314, 306, 286, 285, 283, 283, 281, 270, 267, 267,…
## $ venue <chr> "M Chinnaswamy Stadium", "M Chinnaswamy Stadium", "M C…
## $ date <date> 2013-04-23, 2008-04-18, 2011-05-06, 2022-05-18, 2024-…
It summarizes the distribution of runs scored across all observations.
ipl |>
summarise(
min = min(runs, na.rm = TRUE),
q1 = quantile(runs, 0.25, na.rm = TRUE),
median = median(runs, na.rm = TRUE),
mean = mean(runs, na.rm = TRUE),
q3 = quantile(runs, 0.75, na.rm = TRUE),
max = max(runs, na.rm = TRUE)
)
## # A tibble: 1 Ă— 6
## min q1 median mean q3 max
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 5 13.8 21 175
Runs are heavily right-skewed since the mean is much higher then median and because q1 is zero and min is zero indicates most innings produces low scores and ends with ducks i.e. with zero runs, high scores meaning 100+ runs are exceptional and stand out performance as they are very few.
This reflects typical T20 batting behavior and helps classify batters as per their scoring efficiency.
we can further analyse which player consistently scores with frequent high scores ?
It summarizes the distribution of wickets taken across all observations.
ipl |>
summarise(
min = min(wickets, na.rm = TRUE),
q1 = quantile(wickets, 0.25, na.rm = TRUE),
median = median(wickets, na.rm = TRUE),
mean = mean(wickets, na.rm = TRUE),
q3 = quantile(wickets, 0.75, na.rm = TRUE),
max = max(wickets, na.rm = TRUE)
)
## # A tibble: 1 Ă— 6
## min q1 median mean q3 max
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 0 0.478 1 6Given the result most Bowlers takes 0-1 wickets per match , this is expected in T20 cricket as bowlers bowls 4 or less overs. It is again right skewed distribution and as q3 is 1 means 75% of bowlers take less then or equal to 1 wicket per match . so a max like 6 wickets represents a standout performance .
This helps evaluating bowling performance of players and helps analyzing economy rate vs wickets.
we can further analyse do players with lower economy rate take more wickets ? Are certain venues more bowler-friendly?
ipl|>
count(player, sort=TRUE)
## # A tibble: 751 Ă— 2
## player n
## <chr> <int>
## 1 RG Sharma 255
## 2 MS Dhoni 251
## 3 KD Karthik 249
## 4 V Kohli 248
## 5 RA Jadeja 238
## 6 S Dhawan 222
## 7 R Ashwin 211
## 8 SK Raina 201
## 9 RV Uthappa 200
## 10 AT Rayudu 193
## # ℹ 741 more rows
There are 751 unique players as per above result and 200+ appearances reflects fairly long career of top players .
This helps to identify which players have enough data for deeper analysis.
we can further analyse which player played for one team across all seasons and which has played for most number of different teams?
ipl|>
count(venue, sort=TRUE)
## # A tibble: 58 Ă— 2
## venue n
## <chr> <int>
## 1 Eden Gardens 1652
## 2 Wankhede Stadium 1582
## 3 M Chinnaswamy Stadium 1388
## 4 Feroz Shah Kotla 1291
## 5 Rajiv Gandhi International Stadium, Uppal 1059
## 6 MA Chidambaram Stadium, Chepauk 1037
## 7 Dubai International Cricket Stadium 1014
## 8 Wankhede Stadium, Mumbai 1007
## 9 Sawai Mansingh Stadium 1005
## 10 Punjab Cricket Association Stadium, Mohali 751
## # ℹ 48 more rows
There are 58 unique venues as per above result .
Some Venues host more matches than others.
This helps to identify which venues are high scoring and which are low-scoring ?
This helps to identify which teams have the highest number of match wins at each venue ?
This help teams make strategies based on historical scoring patterns.
Do specific teams perform better at certain venues?
This help teams to identify bowler type.
Which bowling type is more suited for different venues?
This help teams identify batters who can score at a fair clip and accelerate run rate.
How does boundary efficiency vary across venues?
innings_totals <- ipl |>
group_by(match_id, venue, team) |>
summarise(innings_runs = sum(runs, na.rm = TRUE), .groups = "drop")avg_runs_per_venue <- innings_totals |>
group_by(venue) |>
summarise(avg_innings_runs = mean(innings_runs), .groups = "drop") |>
arrange(desc(avg_innings_runs))
avg_runs_per_venue <- avg_runs_per_venue |>
mutate(short_venue = substr(venue, 1, 25))
head(avg_runs_per_venue |> select(short_venue, avg_innings_runs), 10)
## # A tibble: 10 Ă— 2
## short_venue avg_innings_runs
## <chr> <dbl>
## 1 "Dr. Y.S. Rajasekhara Redd" 190.
## 2 "Himachal Pradesh Cricket " 182
## 3 "Arun Jaitley Stadium, Del" 182.
## 4 "M Chinnaswamy Stadium, Be" 180
## 5 "Rajiv Gandhi Internationa" 180.
## 6 "Eden Gardens, Kolkata" 179.
## 7 "Punjab Cricket Associatio" 176.
## 8 "Punjab Cricket Associatio" 167.
## 9 "Narendra Modi Stadium, Ah" 165.
## 10 "Wankhede Stadium, Mumbai" 164.
The table ranks top 10 IPL venues from the highest to the lowest average runs scored per innings.
This reveals which grounds are more Batting friendly which helps teams to decide team combination and batter type.
Further analysis :- Which teams perform best at high‑scoring venues ?
This histogram shows how the IPL top run‑scorer distributes their innings scores when grouped into meaningful 10‑run ranges.
top_scorer <- ipl |>
group_by(player) |>
summarise(total_runs = sum(runs)) |>
arrange(desc(total_runs)) |>
slice(1) |>
pull(player)
top_player_data <- ipl |>
filter(player == top_scorer)
mean_runs_top <- mean(top_player_data$runs)
top_player_data |>
ggplot() +
geom_histogram(
mapping = aes(x = runs),
binwidth = 10,
color = "white",
fill = "blue"
) +
geom_vline(xintercept = mean_runs_top, color = "orange", linewidth = 1) +
annotate(
"text",
x = mean_runs_top + 5,
y = 5,
label = "Average",
color = "orange"
) +
scale_x_continuous(breaks = seq(0, 160, by = 10)) +
theme_classic() +
labs(
title = paste("Run Distribution for", top_scorer),
x = "Runs Scored per Innings",
y = "Count of Innings"
)
Most of Batters innings fall in the 10–40 run range, with occasional high scores extending beyond 60.
This pattern reflects a consistent scoring base with bursts of big innings that elevate his overall impact
further question: does batters scoring distribution shift in wins versus losses?
How the top 10 run‑scorers balance consistency with aggressiveness in their batting.
player_summary <- ipl |>
group_by(player) |>
summarise(
total_runs = sum(runs, na.rm = TRUE),
total_balls = sum(balls_faced, na.rm = TRUE),
career_sr = (total_runs / total_balls) * 100
) |>
arrange(desc(total_runs)) |>
slice_head(n = 10)
player_summary |>
ggplot() +
geom_point(aes(x = career_sr, y = total_runs, color = player,
size = career_sr), alpha = 1) +
theme_minimal() +
labs(
title = "Runs vs Career Strike Rate (Top 10 Run-Scorers)",
x = "Career Strike Rate",
y = "Runs"
)
Among the top 10 run-scorers, players with higher strike rates appear as larger bubbles, revealing which batters combine volume with explosiveness.
The visualization helps distinguish between high-volume accumulators and high-impact aggressive players.
Further questions :-do these players maintain similar strike rates across different venues ?