This Data Dive explores the IPL Player Performance Dataset by generating multiple random subsamples and examining how statistical summaries change across them.
Creating multiple random samples with replacement
Scrutinizing each subsample to observe how summary statistics differ
Identifying anomalies that appear in one subsample but not in others
Detecting aspects of the data that remain consistent across all subsamples
Comparing how conclusions change as the sample size increases (10%, 25%, 75%, etc.)
Considering how this investigation influences the way conclusions should be drawn in future analyses
Each section includes insights, its significance and further questions
ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")
## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): player, team, match_outcome, opposition_team, venue
## dbl (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note: Data Preparation:- The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.
IPL <- ipl_raw |>
mutate(
date = as.Date(date),
season = year(date)
) |>
filter(season < 2025)
nrow(IPL)
## [1] 23925
length(unique(IPL$match_id))
## [1] 1095
The Population consists of all IPL Player -innings from 2008-2024. Each row in the dataset represents one player’s performance in a single match. Defining population clearly is essential for meaningful sampling.
generating 3 random subsamples with sample size of 25% of data.
set.seed(123)
#25% of data, Sample size
sample_frac <- 0.25
#3 subsamples
n_samples <- 3
IPL_samples <- tibble()
for (sample_i in 1:n_samples) {
IPL_i <- IPL |>
sample_n(size = sample_frac * nrow(IPL), replace = TRUE) |>
mutate(sample_num = sample_i)
IPL_samples <- bind_rows(IPL_samples, IPL_i)
}
IPL_samples
## # A tibble: 17,943 × 24
## match_id player team runs balls_faced fours sixes wickets overs_bowled
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1136574 Ishan Kish… Mumb… 0 1 0 0 0 0
## 2 419113 SE Bond Kolk… 1 2 0 0 0 4
## 3 1136584 DJ Bravo Chen… 14 7 1 1 2 4
## 4 1304069 D Brevis Mumb… 49 25 4 5 0 0
## 5 1304101 MR Marsh Delh… 25 20 3 1 1 3
## 6 419153 R Ashwin Chen… 0 0 0 0 1 4
## 7 419133 SB Jakati Chen… 0 0 0 0 2 4
## 8 598016 P Awana King… 0 0 0 0 2 4
## 9 734023 MK Pandey Kolk… 18 28 0 0 0 0
## 10 1359487 B Sai Sudh… Guja… 53 39 3 2 0 0
## # ℹ 17,933 more rows
## # ℹ 15 more variables: balls_bowled <dbl>, runs_conceded <dbl>, catches <dbl>,
## # run_outs <dbl>, maiden <dbl>, stumps <dbl>, match_outcome <chr>,
## # opposition_team <chr>, strike_rate <dbl>, economy <dbl>,
## # fantasy_points <dbl>, venue <chr>, date <date>, season <dbl>,
## # sample_num <int>
IPL_samples|>count(sample_num)
## # A tibble: 3 × 2
## sample_num n
## <int> <int>
## 1 1 5981
## 2 2 5981
## 3 3 5981
Each subsample contains a slightly different mix of player‑innings, even though all 3 subsamples come from same dataset.
It highlights the idea of sampling variability, randomness introduces differences that can influence conclusions.
Further question:-
which variables show the most fluctuation across samples?
would using a larger sample size reduce this variability?
Calculating the mean and standard deviation of key variables: runs, wickets, strike_rate and economy.
# Summarize each subsample
sample_summary <- IPL_samples |>
group_by(sample_num) |>
summarise(
n_rows = n(),
Avg_runs = mean(runs, na.rm = TRUE),
sd_runs = sd(runs, na.rm = TRUE),
avg_wickets = mean(wickets, na.rm = TRUE),
avg_strike_rate = mean(strike_rate, na.rm = TRUE),
avg_economy = mean(economy, na.rm = TRUE)
)
sample_summary
## # A tibble: 3 × 7
## sample_num n_rows Avg_runs sd_runs avg_wickets avg_strike_rate avg_economy
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 5981 14.2 20.2 0.455 77.3 4.14
## 2 2 5981 13.5 19.8 0.482 74.8 4.29
## 3 3 5981 13.4 19.6 0.460 72.9 4.34
Each subsample produced slightly different averages for runs, wickets, strike rate, and economy.
Strike rate fluctuated more than runs, indicating it more sensitive to which of the observations are included.
No single subsample perfectly represents the true population , they are estimates that depend on the sample.
Demonstrates that conclusions based on a single sample can be misleading sometimes.
Highlights the importance of checking stability across multiple samples before making strong claim
further question: How much do these summaries converge as sample size increases?
Condering an anomalous innings as any performance where a plyer took 5 or more wickets, a rare and high‑impact event in T20 cricket termed as 5 Wicket haul.
Counted the number of anomalies and calculated the anomaly rate for each of the 3 subsamples.
Observed how the frequency of 5‑wicket hauls changed across samples of equal size.
anomaly_summary <- IPL_samples |>
mutate(
is_anomaly = wickets >= 5
) |>
group_by(sample_num) |>
summarise(
n_innings = n(),
n_anomaly = sum(is_anomaly, na.rm = TRUE),
anomaly_rate = n_anomaly / n_innings
)
anomaly_summary
## # A tibble: 3 × 4
## sample_num n_innings n_anomaly anomaly_rate
## <int> <int> <int> <dbl>
## 1 1 5981 6 0.00100
## 2 2 5981 14 0.00234
## 3 3 5981 1 0.000167
The number of 5‑wicket hauls varied substantially across samples: one sample had 6, another had 14, and another had only 1
The large variation occurred even though all samples came from the same dataset and had the same number of innings (5981) in subsample.
This demonstrates that rare events such as 5 wicket haul in T20 Cricket fluctuate far more than averages, making them unreliable indicators when using a single sample.
further question: How large must a sample be before the anomaly rate becomes stable?
IPL_samples |>
group_by(sample_num) |>
summarise(n_teams = n_distinct(team),
n_venues = n_distinct(venue))
## # A tibble: 3 × 3
## sample_num n_teams n_venues
## <int> <int> <int>
## 1 1 19 58
## 2 2 19 58
## 3 3 19 58
All three subsamples contained the same number of distinct teams and venue, showing that random sampling preserved team and venue diversity
Drawing 3 random samples at of three sizes 10% , 25% and 75% of dataset
For each subsample , calculted summary statistics of key variables and anomaly_rate where anomaly is number of 5 wickets haul.
Comparing the variation of summary statistics between sample size
set.seed(123)
# sample sizes
sample_fracs <- c(0.10, 0.25, 0.75)
n_samples <- 3
IPL_multi <- tibble()
# Generate 3 samples for each sample size
for (f in sample_fracs) {
for (sample_i in 1:n_samples) {
IPL_i <- IPL |>
sample_n(size = f * nrow(IPL), replace = TRUE) |>
mutate(
sample_num = sample_i,
sample_frac = f
)
IPL_multi <- bind_rows(IPL_multi, IPL_i)
}
}
# Summaries for each sample size and repetition
summary_multi <- IPL_multi |>
group_by(sample_frac, sample_num) |>
summarise(
n_innings = n(),
avg_runs = mean(runs, na.rm = TRUE),
sd_runs = sd(runs, na.rm = TRUE),
avg_wickets = mean(wickets, na.rm = TRUE),
avg_strike_rate = mean(strike_rate, na.rm = TRUE),
n_5w = sum(wickets >= 5, na.rm = TRUE),
anomaly_rate = n_5w / n_innings,
.groups = "drop"
)
summary_multi
## # A tibble: 9 × 9
## sample_frac sample_num n_innings avg_runs sd_runs avg_wickets avg_strike_rate
## <dbl> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.1 1 2392 13.5 19.4 0.475 77.2
## 2 0.1 2 2392 14.7 20.9 0.447 76.1
## 3 0.1 3 2392 14.1 19.7 0.442 77.8
## 4 0.25 1 5981 13.5 19.8 0.479 74.9
## 5 0.25 2 5981 13.4 19.8 0.476 73.2
## 6 0.25 3 5981 14.1 20.1 0.481 75.6
## 7 0.75 1 17943 13.7 20.1 0.484 75.2
## 8 0.75 2 17943 13.8 19.9 0.481 75.3
## 9 0.75 3 17943 13.6 19.5 0.469 75.9
## # ℹ 2 more variables: n_5w <int>, anomaly_rate <dbl>
Smaller samples (10%) show the most variability, while larger samples (75%) produce much more stable summaries. The 25% samples fall in the middle.
10% samples showed the highest variability across the three repetitions.
average runs ranged from 13.5 to 14.7, Standard deviation runs from 19.4 to 20.9.and average Strike_ rate also varied more from 76.06 to 77.75.
This shows that small samples are more variable and sensitive to random chance.
25% samples were more stable, but still showed noticeable variation.
Average runs stayed between 13.36 and 14.14, standard deviation runs between 19.77 and 20.08
Rare events (5‑wicket hauls) still fluctuated significantly.
75% samples were the most stable and consistent.
Average runs stayed tightly between 13.61 and 13.80, Standard deviation runs between 19.46 and 20.06, and average strike_rate around 7.22–75.89.
Even anomaly counts became more consistent (8, 24, 24).
As sample size increased, all metrics converged toward the population values, and variability across the three repetitions decreased dramatically.
Demonstrates the fundamental statistical principle that larger samples produce more reliable and stable estimates.
further questions: How do team‑level or venue‑level summaries behave across different sample sizes.
Always consider sample size before trusting a result and avoid drawing strong conclusions from small or single samples instead use larger samples whenever possible.
Check consistency across multiple samples and also when interpreting rare events since stable averages do not guarantee stable anomalies
Understand that sampling variability is unavoidable and avoid overgeneralizing patterns from limited data.
Use structural checks (teams, venues, seasons) to ensure that samples represent the full population before interpreting results.
In future analysis, conclusions should be framed with uncertainty, acknowledging that different samples may lead to slightly different interpretations unless the sample is sufficiently large.