Introduction

This Data Dive explores IPL Player Performance Dataset by examining the importance of data documentation and importance of referencing the documentation.

list of 2 columns (or values) in the data which are unclear until reading the documentation.
one element or data that is unclear even after reading the documentation.
two visualizations using a column of data that is affected by the issue in above #2.
two categorical columns
- explicitly missing rows
- implicitly missing rows
- empty groups

one continuous column

outlier

Columns which are unclear until reading the documentation

Stumps

The stumps column is deceptively simple but actually quite ambiguous without documentation. At first glance, it is easy to assume that this value indicates whether a player was bowled out ,since “hitting the stumps” is a common mode of dismissal in cricket. However, this interpretation is incorrect. The column does not represent how a player got out. Instead, it records the number of stumpings performed by the player, which is a completely different fielding action.

A stumping is a dismissal executed only by the wicketkeeper, This means the stumps column reflects how many stumpings a player (specifically the wicketkeeper) completed in that match, not whether the batter was dismissed via the stumps. Because only wicketkeepers can perform stumpings, this value will be 0 for almost every player in the dataset.

Stumpings are a rare but important wicket‑keeping statistics to track keeper performance across matches. so, the user needsto understand that it refers to performed stumpings, not dismissal type.

Without reading the documentation, one might:

misinterpret stumps as indicating whether a batter was bowled
treat it as a player‑level dismissal mode rather than a wicket‑keeper action
incorrectly analyze batting dismissals using this column

misunderstand why almost all players have a value of 0

Because the meaning of “stumps” is not obvious from the column name alone, it can easily be misread unless the documentation clarifies that it represents stumpings performed by the wicketkeeper, not how a batter got out.

fantasy_points

The fantasy_points column contains a single numeric value for each player innings, but the dataset does not explain how these points are calculated. Fantasy scoring systems typically combine multiple performance columns like runs, wickets, strike rate, economy, catches but none of this is visible from the raw data.

Fantasy points summarize a player’s overall impact in a match into one number. A single numeric column is convenient, but only after the calculation rules are understood.

Without knowing the scoring system, we might:

assume fantasy points reflect batting performance only
misinterpret high fantasy scores as “good innings” even if they came from bowling
build incorrect models or visualizations based on misunderstood scoring logic

Because the scoring formula is not transparent, fantasy points can easily be misused or misinterpreted.

Element that is unclear even after reading the documentation

Venue

The venue column remains a little unclear, even after reviewing the dataset documentation. The documentation lists the stadium names used in the IPL, but it does not explain how venue names were standardized across seasons. As a result, the dataset contains multiple variations of what appears to be the same stadium. Examples include:

“Punjab Cricket Association Stadium, Mohali” vs “Punjab Cricket Association IS Bindra Stadium”
“Subrata Roy Sahara Stadium” vs “Maharashtra Cricket Association Stadium”
“MA Chidambaram Stadium, Chepauk” vs “MA Chidambaram Stadium”

The documentation does not clarify whether these differences represent:

true distinct venues
renamed stadiums
abbreviations
inconsistent data entry

This ambiguity makes it difficult to group matches by venue or analyze venue-specific performance trends. For example, if the same stadium appears under two slightly different names, the data gets split into separate categories, which can distort summaries, averages, and visualizations.

Even after reading the documentation, it is still unclear that which venue names should be merged, and to identify the city in which the stadium are located.

Visualizations highlighting the unclear Venue issue

1: Number of Matches by Venue (Bar Chart)

ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")

## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): player, team, match_outcome, opposition_team, venue
## dbl  (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.

IPL <- ipl_raw |>
  mutate(
    date = as.Date(date),
    season = year(date)
  ) |>
  filter(season < 2025)

matches_by_venue <- IPL |>
  distinct(match_id, venue) |>
  count(venue, name = "match_count")

# Top 25 venues by match count
matches_top10 <- matches_by_venue |>
  slice_max(order_by = match_count, n = 25)

matches_top10 |>
  ggplot() +
  geom_col(
    mapping = aes(
      x = reorder(venue, match_count),
      y = match_count
    ),
    fill = "steelblue"
  ) +
  geom_text(
    mapping = aes(
      x = venue,
      y = match_count,
      label = match_count
    ),
    hjust = -0.2,
    size = 3
    
  ) +
  coord_flip() +
  labs(
    title = "Top 25 Venues by Matches",
    x = "Venue",
    y = "Number of Matches"
  ) +
  theme_minimal()

This visualization makes the venue inconsistency immediately visible. For example, if “M Chinnaswamy Stadium,” “M.Chinnaswamy Stadium,” and “M Chinnaswamy Stadium, Bengaluru” appear as separate categories, they show up as three different bars even though they refer to the same stadium. This artificially splits the match counts and can mislead any venue‑based analysis. Without standardizing venue names, conclusions about match frequency, stadium popularity, or home‑ground effects would be unreliable.

2: Number of Matches by City (using Venue - City Mapping)

I standardized the venue column by mapping each stadium name to the city in which it is located. The venue column contains many inconsistencies such as different spellings, punctuation differences, and multiple naming formats for the same stadium. These inconsistencies distort any venue level analysis.

To reduce these negative consequences, I created a venue → city mapping that consolidates all variations of a stadium name into a single, correct city label. This mapping allows me to aggregate matches at the city level even if the venue names differ in the raw data.

venue_city_map <- c(
  "M Chinnaswamy Stadium" = "Bengaluru",
  "M.Chinnaswamy Stadium" = "Bengaluru",
  "M Chinnaswamy Stadium, Bengaluru" = "Bengaluru",
  "Dr DY Patil Sports Academy, Mumbai" = "Mumbai",
  "Dr DY Patil Sports Academy" = "Mumbai",
  "Eden Gardens, Kolkata" = "Kolkata",
  "Eden Gardens" = "Kolkata",
  "Wankhede Stadium, Mumbai" = "Mumbai",
  "Wankhede Stadium" = "Mumbai",
  "Rajiv Gandhi International Stadium, Uppal" = "Hyderabad",
  "Rajiv Gandhi International Stadium, Uppal, Hyderabad" = "Hyderabad",
  "Rajiv Gandhi International Stadium" = "Hyderabad",
  "Feroz Shah Kotla" = "Delhi",
  "Arun Jaitley Stadium" = "Delhi",
  "Arun Jaitley Stadium, Delhi" = "Delhi",
  "Dubai International Cricket Stadium" = "Other",
  "Sheikh Zayed Stadium" = "Other",
  "Sharjah Cricket Stadium" = "Other",
  "Zayed Cricket Stadium, Abu Dhabi" = "Other",
  "SuperSport Park" = "Other",
  "Kingsmead" = "Other",
  "St George's Park" = "Other",
  "Newlands" = "Other",
  "Buffalo Park" = "Other",
  "OUTsurance Oval" = "Other",
  "New Wanderers Stadium" = "Other",
  "De Beers Diamond Oval" = "Other",
  "MA Chidambaram Stadium, Chepauk" = "Chennai",
  "MA Chidambaram Stadium, Chepauk, Chennai" = "Chennai",
  "MA Chidambaram Stadium" = "Chennai",
  "Brabourne Stadium" = "Mumbai",
  "Brabourne Stadium, Mumbai" = "Mumbai",
  "Narendra Modi Stadium, Ahmedabad" = "Ahmedabad",
  "Sardar Patel Stadium, Motera" = "Ahmedabad",
  "Himachal Pradesh Cricket Association Stadium" = "Dharamshala",
  "Himachal Pradesh Cricket Association Stadium, Dharamsala" = "Dharamshala",
  "Punjab Cricket Association Stadium, Mohali" = "Mohali",
  "Punjab Cricket Association IS Bindra Stadium" = "Mohali",
  "Punjab Cricket Association IS Bindra Stadium, Mohali" = "Mohali",
  "Punjab Cricket Association IS Bindra Stadium, Mohali, Chandigarh" = "Mohali",
  "Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium" = "Vizag",
  "Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium, Visakhapatnam" = "Vizag",
  "Maharashtra Cricket Association Stadium" = "Pune",
  "Maharashtra Cricket Association Stadium, Pune" = "Pune",
  "Sawai Mansingh Stadium, Jaipur" = "Jaipur",
  "Sawai Mansingh Stadium" = "Jaipur",
  "Barabati Stadium" = "Cuttack",
  "Green Park" = "Kanpur",
  "Holkar Cricket Stadium" = "Indore",
  "JSCA International Stadium Complex" = "Ranchi",
  "Barsapara Cricket Stadium, Guwahati" = "Guwahati",
  "Nehru Stadium" = "Kochi",
  "Saurashtra Cricket Association Stadium" = "Rajkot",
  "Subrata Roy Sahara Stadium" = "Pune",
  "Shaheed Veer Narayan Singh International Stadium" = "Raipur",
  "Vidarbha Cricket Association Stadium, Jamtha" = "Nagpur",
  "Maharaja Yadavindra Singh International Cricket Stadium, Mullanpur" = "Chandigarh",
"Bharat Ratna Shri Atal Bihari Vajpayee Ekana Cricket Stadium, Lucknow" = "Lucknow"
  )

# Applying venue → city mapping
IPL_city <- IPL |>
  mutate(
    venue_city = recode(venue, !!!venue_city_map)
  )
# Creating match-level dataset 
matches_city_df <- IPL_city |>
  distinct(match_id, .keep_all = TRUE)

# Count matches by city
matches_by_city <- matches_city_df |>
  group_by(venue_city) |>
  summarise(match_count = n()) |>
  arrange(desc(match_count))

matches_by_city |>
  ggplot() +
  geom_col(
    mapping = aes(
      x = reorder(venue_city, match_count),
      y = match_count,
         ),
   fill = "steelblue"
  ) +
  geom_text(
    mapping = aes(
      x = venue_city,
      y = match_count,
      label = match_count
    ),
    hjust = -0.2,
    size = 3
  ) +
  coord_flip() +
  labs(
    title = "Number of IPL Matches by City ( after Venue Mapping)",
    x = "City",
    y = "Number of Matches"
  ) +
  theme_minimal()

By consolidating all variations of a stadium name into a single city label, the chart provides a much clearer and accurate picture of where matches have been played. After mapping, the inconsistencies in venue names are resolved and the cleaned visualization makes city‑level comparisons more meaningful and provides reliable analysis of hosting patterns and trends.

Failing to standardize venue names before analysis introduces significant risks like incorrect city‑level insights and misleading geographic patterns.

Missing Rows in Categorical Columns

Explicitly Missing Rows:- (date and Team)

ipl_raw |> summarise(explicit_missing_date = sum(is.na(date)))

## # A tibble: 1 × 1
##   explicit_missing_date
##                   <int>
## 1                     0

ipl_raw |> summarise(explicit_missing_team = sum(is.na(team)))

## # A tibble: 1 × 1
##   explicit_missing_team
##                   <int>
## 1                     0

ipl_raw |> filter(is.na(venue) | is.na(team))

## # A tibble: 0 × 22
## # ℹ 22 variables: match_id <dbl>, player <chr>, team <chr>, runs <dbl>,
## #   balls_faced <dbl>, fours <dbl>, sixes <dbl>, wickets <dbl>,
## #   overs_bowled <dbl>, balls_bowled <dbl>, runs_conceded <dbl>, catches <dbl>,
## #   run_outs <dbl>, maiden <dbl>, stumps <dbl>, match_outcome <chr>,
## #   opposition_team <chr>, strike_rate <dbl>, economy <dbl>,
## #   fantasy_points <dbl>, venue <chr>, date <date>

checked the date and team columns in the raw IPL dataset, both returned 0 explicitly missing rows. This means neither column contains any NA values or blank entries. Every row in the dataset has a valid date and team recorded. Because of this, there is no evidence of data loss or incomplete entries for these two categorical variables at the explicit level.

Implicitly Missing Rows:- (date and Team)

expected_teams <- c(
  "Mumbai Indians", "Chennai Super Kings", "Royal Challengers Bengaluru",
  "Delhi Capitals", "Kolkata Knight Riders", "Sunrisers Hyderabad",
  "Rajasthan Royals", "Punjab Kings", "Gujarat Titans", "Lucknow Super Giants",
  "Deccan Chargers", "Gujarat Lions", "Kochi Tuskers Kerala", "Pune Warriors",
  "Rising Pune Supergiant"
)

actual_teams <- unique(ipl_raw$team)

setdiff(expected_teams, actual_teams)

## character(0)

When comparing the expected list of IPL franchises with the actual team categories in the dataset, the result returned character(0). This means that no team categories are implicitly missing — every franchise that has existed in the IPL appears at least once in the data.

ipl_raw |>
  mutate(season = year(date)) |>
  distinct(match_id, season) |>
  count(season)|>
  print(n=Inf)

## # A tibble: 18 × 2
##    season     n
##     <dbl> <int>
##  1   2008    58
##  2   2009    57
##  3   2010    60
##  4   2011    73
##  5   2012    74
##  6   2013    76
##  7   2014    60
##  8   2015    59
##  9   2016    60
## 10   2017    59
## 11   2018    60
## 12   2019    60
## 13   2020    60
## 14   2021    60
## 15   2022    74
## 16   2023    74
## 17   2024    71
## 18   2025     5

Created season from date column and counted matches per season. The results show that all seasons from 2008 to 2024 have a normal number of matches, but the 2025 season contains only 5 matches. A typical IPL season has around 60–74 matches, so the extremely low count for 2025 indicates that this season is incomplete. Although there are no NA values in the date column, the missing matches create implicit missingness, because the category “2025” exists but is missing a large portion of its expected data.

Empty Groups

ipl_raw_clean <- ipl_raw |>
  mutate(
    team = case_when(
      str_detect(team, "Delhi") ~ "Delhi Daredevils",
      str_detect(team, "Punjab") ~ "Kings XI Punjab",
      str_detect(team, "Rising Pune") ~ "Rising Pune Supergiant",
      str_detect(team, "Royal Challengers") ~ "Royal Challengers Bangalore",
      TRUE ~ team
    )
  )

all_combos <- expand_grid(
  team = unique(ipl_raw_clean$team),
  season = unique(year(ipl_raw_clean$date))
)

actual_groups <- ipl_raw_clean |>
  mutate(season = year(date)) |>
distinct(team, season)

empty_groups <- anti_join(all_combos, actual_groups, by = c("team", "season"))
empty_groups

## # A tibble: 114 × 2
##    team                 season
##    <chr>                 <dbl>
##  1 Lucknow Super Giants   2013
##  2 Lucknow Super Giants   2008
##  3 Lucknow Super Giants   2011
##  4 Lucknow Super Giants   2016
##  5 Lucknow Super Giants   2014
##  6 Lucknow Super Giants   2015
##  7 Lucknow Super Giants   2012
##  8 Lucknow Super Giants   2018
##  9 Lucknow Super Giants   2017
## 10 Lucknow Super Giants   2020
## # ℹ 104 more rows

For empty groups, first standardized team names to remove inconsistencies (e.g., “Royal Challengers Bangalore” vs. “Royal Challengers Bengaluru”). Generated all possible team × season combinations and compared it with the combinations that appear in the dataset. The resulting empty groups represent team–season pairs where the team exists in the dataset and the season exists in the dataset, but the team has no observations for that season.

These are not missing values; they simply reflect participation of different teams over all IPL season. For example, Lucknow Super Giants appear as empty groups for seasons before 2022 because the franchise did not exist then, and older teams like Deccan Chargers or Kochi Tuskers Kerala appear only in the seasons in which they participated. Empty groups therefore indicate valid category combinations with zero rows due to team participation history, not due to missing data.

Outlier Definition for a Continuous Column

I selected the economy column, which measures the number of runs a player concedes per over. Economy rate is a continuous variable and is known to be right‑skewed in cricket, because a few extremely expensive spells can inflate the upper tail of the distribution.

For the economy rate variable, an outlier represents a bowling performance that is unusually economical or unusually expensive compared to the typical range of career economy rates. To identify such extreme values, I use the IQR . Calculated the first quartile (Q1) and third quartile (Q3) of the economy rate distribution, then compute the IQR as Q3-Q1 and compute the IQR as the difference between them.

Any player whose career economy rate falls outside the whiskers of the boxplot is considered an outlier. These whiskers are based on the IQR rule and capture the expected spread of the data, so values beyond them represent truly exceptional long‑term performance patterns rather than one‑off match fluctuations.

# Computing career totals per bowler
career_stats <- ipl_raw_clean |>
  group_by(player) |>
  summarise(
    total_wkts = sum(wickets, na.rm = TRUE),
    total_runs = sum(runs_conceded, na.rm = TRUE),
    total_overs = sum(overs_bowled, na.rm = TRUE)
  ) |>
  mutate(
    career_economy = total_runs / total_overs
  ) |>
  drop_na(career_economy)

# Selecting top 100 players by wickets
top100 <- career_stats |>
  arrange(desc(total_wkts)) |>
  slice_head(n = 100)

# Computing IQR thresholds
iqr_vals <- top100 |>
  summarise(
    Q1 = quantile(career_economy, 0.25),
    Q3 = quantile(career_economy, 0.75)
  ) |>
  mutate(
    IQR = Q3 - Q1,
    lower = Q1 - 1.5 * IQR,
    upper = Q3 + 1.5 * IQR
  )

lower_thr <- iqr_vals$lower
upper_thr <- iqr_vals$upper

#Flagging outliers
top100 <- top100 |>
  mutate(
    outlier = career_economy < lower_thr | career_economy > upper_thr
  )

# Boxplot
ggplot(top100) +
  geom_boxplot(aes(x = "", y = career_economy), fill = "lightblue") +
  geom_jitter(
    aes(x = "", y = career_economy, color = outlier),
    width = 0.15,
    alpha = 0.7
  ) +
  scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red")) +
  labs(
    title = "Career Economy Rate for Top 100 Wicket-Takers",
    x = "",
    y = "Career Economy Rate",
    color = "Outlier"
  ) +
  theme_minimal()

The boxplot shows the distribution of career economy rates for the top 100 wicket‑taking bowlers in the dataset. The central box represents the middle 50% of bowlers, with the median economy rate lying near the center of the box. The relatively narrow interquartile range indicates that most top bowlers maintain economy rates within a fairly consistent band. The whiskers extend to the minimum and maximum values that fall within the IQR rule, and in this case, no points fall outside the whiskers, meaning there are no statistical outliers in the career economy rates of these top performers. This suggests that among bowlers with substantial careers, economy rates tend to cluster tightly, and extreme long‑term performances either exceptionally economical or unusually expensive are rare. Overall, the distribution reflects a stable and balanced set of bowling performances across the top wicket‑takers.

further question:- Do certain venues produce more high‑economy outliers?

Week5_Datadive

Mayank Gupta

2026-02-10