Introduction

This Data Dive explores the IPL Player Performance Dataset by generating multiple random subsamples and examining how statistical summaries change across them.

Creating multiple random samples with replacement
Scrutinizing each subsample to observe how summary statistics differ
Identifying anomalies that appear in one subsample but not in others
Detecting aspects of the data that remain consistent across all subsamples
Comparing how conclusions change as the sample size increases (10%, 25%, 75%, etc.)
Considering how this investigation influences the way conclusions should be drawn in future analyses

Each section includes insights, its significance and further questions

ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")

## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): player, team, match_outcome, opposition_team, venue
## dbl  (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Preparation and Population defintion

Note: Data Preparation:- The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.

IPL <- ipl_raw |>
  mutate(
    date = as.Date(date),
    season = year(date)
  ) |>
  filter(season < 2025)
nrow(IPL)

## [1] 23925

length(unique(IPL$match_id))

## [1] 1095

The Population consists of all IPL Player -innings from 2008-2024. Each row in the dataset represents one player’s performance in a single match. Defining population clearly is essential for meaningful sampling.

Creating multiple random samples with replacement

generating 3 random subsamples with sample size of 25% of data.

set.seed(123)
#25% of data, Sample size
sample_frac <- 0.25
#3 subsamples
n_samples   <- 3
IPL_samples <- tibble()
for (sample_i in 1:n_samples) {
  IPL_i <- IPL |>
    sample_n(size = sample_frac * nrow(IPL), replace = TRUE) |>
    mutate(sample_num = sample_i)
  IPL_samples <- bind_rows(IPL_samples, IPL_i)
}
IPL_samples

## # A tibble: 17,943 × 24
##    match_id player      team   runs balls_faced fours sixes wickets overs_bowled
##       <dbl> <chr>       <chr> <dbl>       <dbl> <dbl> <dbl>   <dbl>        <dbl>
##  1  1136574 Ishan Kish… Mumb…     0           1     0     0       0            0
##  2   419113 SE Bond     Kolk…     1           2     0     0       0            4
##  3  1136584 DJ Bravo    Chen…    14           7     1     1       2            4
##  4  1304069 D Brevis    Mumb…    49          25     4     5       0            0
##  5  1304101 MR Marsh    Delh…    25          20     3     1       1            3
##  6   419153 R Ashwin    Chen…     0           0     0     0       1            4
##  7   419133 SB Jakati   Chen…     0           0     0     0       2            4
##  8   598016 P Awana     King…     0           0     0     0       2            4
##  9   734023 MK Pandey   Kolk…    18          28     0     0       0            0
## 10  1359487 B Sai Sudh… Guja…    53          39     3     2       0            0
## # ℹ 17,933 more rows
## # ℹ 15 more variables: balls_bowled <dbl>, runs_conceded <dbl>, catches <dbl>,
## #   run_outs <dbl>, maiden <dbl>, stumps <dbl>, match_outcome <chr>,
## #   opposition_team <chr>, strike_rate <dbl>, economy <dbl>,
## #   fantasy_points <dbl>, venue <chr>, date <date>, season <dbl>,
## #   sample_num <int>

IPL_samples|>count(sample_num)

## # A tibble: 3 × 2
##   sample_num     n
##        <int> <int>
## 1          1  5981
## 2          2  5981
## 3          3  5981

Each subsample contains a slightly different mix of player‑innings, even though all 3 subsamples come from same dataset.
It highlights the idea of sampling variability, randomness introduces differences that can influence conclusions.

Further question:-

which variables show the most fluctuation across samples?
would using a larger sample size reduce this variability?

Scrutinizing each subsample to observe how summary statistics differ

Calculating the mean and standard deviation of key variables: runs, wickets, strike_rate and economy.

# Summarize each subsample
sample_summary <- IPL_samples |>
  group_by(sample_num) |>
  summarise(
    n_rows       = n(),
    Avg_runs    = mean(runs, na.rm = TRUE),
    sd_runs      = sd(runs, na.rm = TRUE),
    avg_wickets = mean(wickets, na.rm = TRUE),
    avg_strike_rate  = mean(strike_rate, na.rm = TRUE),
    avg_economy = mean(economy, na.rm = TRUE)
  )

sample_summary

## # A tibble: 3 × 7
##   sample_num n_rows Avg_runs sd_runs avg_wickets avg_strike_rate avg_economy
##        <int>  <int>    <dbl>   <dbl>       <dbl>           <dbl>       <dbl>
## 1          1   5981     14.2    20.2       0.455            77.3        4.14
## 2          2   5981     13.5    19.8       0.482            74.8        4.29
## 3          3   5981     13.4    19.6       0.460            72.9        4.34

Each subsample produced slightly different averages for runs, wickets, strike rate, and economy.
Strike rate fluctuated more than runs, indicating it more sensitive to which of the observations are included.
No single subsample perfectly represents the true population , they are estimates that depend on the sample.
Demonstrates that conclusions based on a single sample can be misleading sometimes.
Highlights the importance of checking stability across multiple samples before making strong claim
further question: How much do these summaries converge as sample size increases?

Identifying anomalies and consistent patterns

Condering an anomalous innings as any performance where a plyer took 5 or more wickets, a rare and high‑impact event in T20 cricket termed as 5 Wicket haul.
Counted the number of anomalies and calculated the anomaly rate for each of the 3 subsamples.
Observed how the frequency of 5‑wicket hauls changed across samples of equal size.

anomaly_summary <- IPL_samples |>
  mutate(
    is_anomaly = wickets >= 5
  ) |>
  group_by(sample_num) |>
  summarise(
    n_innings   = n(),
    n_anomaly   = sum(is_anomaly, na.rm = TRUE),
    anomaly_rate = n_anomaly / n_innings
  )

anomaly_summary

## # A tibble: 3 × 4
##   sample_num n_innings n_anomaly anomaly_rate
##        <int>     <int>     <int>        <dbl>
## 1          1      5981         6     0.00100 
## 2          2      5981        14     0.00234 
## 3          3      5981         1     0.000167

The number of 5‑wicket hauls varied substantially across samples: one sample had 6, another had 14, and another had only 1
The large variation occurred even though all samples came from the same dataset and had the same number of innings (5981) in subsample.
This demonstrates that rare events such as 5 wicket haul in T20 Cricket fluctuate far more than averages, making them unreliable indicators when using a single sample.
further question: How large must a sample be before the anomaly rate becomes stable?

IPL_samples |> 
  group_by(sample_num) |> 
  summarise(n_teams = n_distinct(team),
n_venues = n_distinct(venue))

## # A tibble: 3 × 3
##   sample_num n_teams n_venues
##        <int>   <int>    <int>
## 1          1      19       58
## 2          2      19       58
## 3          3      19       58

All three subsamples contained the same number of distinct teams and venue, showing that random sampling preserved team and venue diversity

Comparing how conclusions change as the sample size increases (10%, 25%, 75%, etc.)

Drawing 3 random samples at of three sizes 10% , 25% and 75% of dataset

For each subsample , calculted summary statistics of key variables and anomaly_rate where anomaly is number of 5 wickets haul.

Comparing the variation of summary statistics between sample size

set.seed(123)   
# sample sizes
sample_fracs <- c(0.10, 0.25, 0.75)
n_samples    <- 3
IPL_multi <- tibble()
# Generate 3 samples for each sample size
for (f in sample_fracs) {
  for (sample_i in 1:n_samples) {
 
    IPL_i <- IPL |>
      sample_n(size = f * nrow(IPL), replace = TRUE) |>
      mutate(
        sample_num  = sample_i,
        sample_frac = f
      )
    
    IPL_multi <- bind_rows(IPL_multi, IPL_i)
  }
}
# Summaries for each sample size and repetition
summary_multi <- IPL_multi |>
  group_by(sample_frac, sample_num) |>
  summarise(
    n_innings    = n(),
    avg_runs    = mean(runs, na.rm = TRUE),
    sd_runs      = sd(runs, na.rm = TRUE),
    avg_wickets = mean(wickets, na.rm = TRUE),
    avg_strike_rate  = mean(strike_rate, na.rm = TRUE),
    n_5w         = sum(wickets >= 5, na.rm = TRUE),
    anomaly_rate = n_5w / n_innings,
    .groups = "drop"
  )
summary_multi

## # A tibble: 9 × 9
##   sample_frac sample_num n_innings avg_runs sd_runs avg_wickets avg_strike_rate
##         <dbl>      <int>     <int>    <dbl>   <dbl>       <dbl>           <dbl>
## 1        0.1           1      2392     13.5    19.4       0.475            77.2
## 2        0.1           2      2392     14.7    20.9       0.447            76.1
## 3        0.1           3      2392     14.1    19.7       0.442            77.8
## 4        0.25          1      5981     13.5    19.8       0.479            74.9
## 5        0.25          2      5981     13.4    19.8       0.476            73.2
## 6        0.25          3      5981     14.1    20.1       0.481            75.6
## 7        0.75          1     17943     13.7    20.1       0.484            75.2
## 8        0.75          2     17943     13.8    19.9       0.481            75.3
## 9        0.75          3     17943     13.6    19.5       0.469            75.9
## # ℹ 2 more variables: n_5w <int>, anomaly_rate <dbl>

Smaller samples (10%) show the most variability, while larger samples (75%) produce much more stable summaries. The 25% samples fall in the middle.

10% samples showed the highest variability across the three repetitions.
- average runs ranged from 13.5 to 14.7, Standard deviation runs from 19.4 to 20.9.and average Strike_ rate also varied more from 76.06 to 77.75.
- This shows that small samples are more variable and sensitive to random chance.
25% samples were more stable, but still showed noticeable variation.
- Average runs stayed between 13.36 and 14.14, standard deviation runs between 19.77 and 20.08
- Rare events (5‑wicket hauls) still fluctuated significantly.
75% samples were the most stable and consistent.
- Average runs stayed tightly between 13.61 and 13.80, Standard deviation runs between 19.46 and 20.06, and average strike_rate around 7.22–75.89.
- Even anomaly counts became more consistent (8, 24, 24).
As sample size increased, all metrics converged toward the population values, and variability across the three repetitions decreased dramatically.
Demonstrates the fundamental statistical principle that larger samples produce more reliable and stable estimates.
further questions: How do team‑level or venue‑level summaries behave across different sample sizes.

how this investigation influences the way conclusions should be drawn in future analyses

Always consider sample size before trusting a result and avoid drawing strong conclusions from small or single samples instead use larger samples whenever possible.
Check consistency across multiple samples and also when interpreting rare events since stable averages do not guarantee stable anomalies
Understand that sampling variability is unavoidable and avoid overgeneralizing patterns from limited data.
Use structural checks (teams, venues, seasons) to ensure that samples represent the full population before interpreting results.
In future analysis, conclusions should be framed with uncertainty, acknowledging that different samples may lead to slightly different interpretations unless the sample is sufficiently large.

Week4_Datadive

Mayank Gupta

2026-02-08