This Data Dive explores IPL Player Performance Dataset by examining the importance of data documentation and importance of referencing the documentation.
list of 2 columns (or values) in the data which are unclear until reading the documentation.
one element or data that is unclear even after reading the documentation.
two visualizations using a column of data that is affected by the issue in above #2.
two categorical columns
explicitly missing rows
implicitly missing rows
empty groups
one continuous column
outlier
The stumps column is deceptively simple but actually quite ambiguous without documentation. At first glance, it is easy to assume that this value indicates whether a player was bowled out ,since “hitting the stumps” is a common mode of dismissal in cricket. However, this interpretation is incorrect. The column does not represent how a player got out. Instead, it records the number of stumpings performed by the player, which is a completely different fielding action.
A stumping is a dismissal executed only by the wicketkeeper, This means the stumps column reflects how many stumpings a player (specifically the wicketkeeper) completed in that match, not whether the batter was dismissed via the stumps. Because only wicketkeepers can perform stumpings, this value will be 0 for almost every player in the dataset.
Stumpings are a rare but important wicket‑keeping statistics to track keeper performance across matches. so, the user needsto understand that it refers to performed stumpings, not dismissal type.
Without reading the documentation, one might:
misinterpret stumps as indicating whether a batter was bowled
treat it as a player‑level dismissal mode rather than a wicket‑keeper action
incorrectly analyze batting dismissals using this column
misunderstand why almost all players have a value of 0
Because the meaning of “stumps” is not obvious from the column name alone, it can easily be misread unless the documentation clarifies that it represents stumpings performed by the wicketkeeper, not how a batter got out.
The fantasy_points column contains a single numeric value for each player innings, but the dataset does not explain how these points are calculated. Fantasy scoring systems typically combine multiple performance columns like runs, wickets, strike rate, economy, catches but none of this is visible from the raw data.
Fantasy points summarize a player’s overall impact in a match into one number. A single numeric column is convenient, but only after the calculation rules are understood.
Without knowing the scoring system, we might:
assume fantasy points reflect batting performance only
misinterpret high fantasy scores as “good innings” even if they came from bowling
build incorrect models or visualizations based on misunderstood scoring logic
Because the scoring formula is not transparent, fantasy points can easily be misused or misinterpreted.
The venue column remains a little unclear, even after reviewing the dataset documentation. The documentation lists the stadium names used in the IPL, but it does not explain how venue names were standardized across seasons. As a result, the dataset contains multiple variations of what appears to be the same stadium. Examples include:
“Punjab Cricket Association Stadium, Mohali” vs “Punjab Cricket Association IS Bindra Stadium”
“Subrata Roy Sahara Stadium” vs “Maharashtra Cricket Association Stadium”
“MA Chidambaram Stadium, Chepauk” vs “MA Chidambaram Stadium”
The documentation does not clarify whether these differences represent:
true distinct venues
renamed stadiums
abbreviations
inconsistent data entry
This ambiguity makes it difficult to group matches by venue or analyze venue-specific performance trends. For example, if the same stadium appears under two slightly different names, the data gets split into separate categories, which can distort summaries, averages, and visualizations.
Even after reading the documentation, it is still unclear that which venue names should be merged, and to identify the city in which the stadium are located.
ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")
## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): player, team, match_outcome, opposition_team, venue
## dbl (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.
IPL <- ipl_raw |>
mutate(
date = as.Date(date),
season = year(date)
) |>
filter(season < 2025)
matches_by_venue <- IPL |>
distinct(match_id, venue) |>
count(venue, name = "match_count")
# Top 25 venues by match count
matches_top10 <- matches_by_venue |>
slice_max(order_by = match_count, n = 25)
matches_top10 |>
ggplot() +
geom_col(
mapping = aes(
x = reorder(venue, match_count),
y = match_count
),
fill = "steelblue"
) +
geom_text(
mapping = aes(
x = venue,
y = match_count,
label = match_count
),
hjust = -0.2,
size = 3
) +
coord_flip() +
labs(
title = "Top 25 Venues by Matches",
x = "Venue",
y = "Number of Matches"
) +
theme_minimal()
This visualization makes the venue inconsistency immediately visible. For example, if “M Chinnaswamy Stadium,” “M.Chinnaswamy Stadium,” and “M Chinnaswamy Stadium, Bengaluru” appear as separate categories, they show up as three different bars even though they refer to the same stadium. This artificially splits the match counts and can mislead any venue‑based analysis. Without standardizing venue names, conclusions about match frequency, stadium popularity, or home‑ground effects would be unreliable.
I standardized the venue column by mapping each stadium name to the city in which it is located. The venue column contains many inconsistencies such as different spellings, punctuation differences, and multiple naming formats for the same stadium. These inconsistencies distort any venue level analysis.
To reduce these negative consequences, I created a venue → city mapping that consolidates all variations of a stadium name into a single, correct city label. This mapping allows me to aggregate matches at the city level even if the venue names differ in the raw data.
venue_city_map <- c(
"M Chinnaswamy Stadium" = "Bengaluru",
"M.Chinnaswamy Stadium" = "Bengaluru",
"M Chinnaswamy Stadium, Bengaluru" = "Bengaluru",
"Dr DY Patil Sports Academy, Mumbai" = "Mumbai",
"Dr DY Patil Sports Academy" = "Mumbai",
"Eden Gardens, Kolkata" = "Kolkata",
"Eden Gardens" = "Kolkata",
"Wankhede Stadium, Mumbai" = "Mumbai",
"Wankhede Stadium" = "Mumbai",
"Rajiv Gandhi International Stadium, Uppal" = "Hyderabad",
"Rajiv Gandhi International Stadium, Uppal, Hyderabad" = "Hyderabad",
"Rajiv Gandhi International Stadium" = "Hyderabad",
"Feroz Shah Kotla" = "Delhi",
"Arun Jaitley Stadium" = "Delhi",
"Arun Jaitley Stadium, Delhi" = "Delhi",
"Dubai International Cricket Stadium" = "Other",
"Sheikh Zayed Stadium" = "Other",
"Sharjah Cricket Stadium" = "Other",
"Zayed Cricket Stadium, Abu Dhabi" = "Other",
"SuperSport Park" = "Other",
"Kingsmead" = "Other",
"St George's Park" = "Other",
"Newlands" = "Other",
"Buffalo Park" = "Other",
"OUTsurance Oval" = "Other",
"New Wanderers Stadium" = "Other",
"De Beers Diamond Oval" = "Other",
"MA Chidambaram Stadium, Chepauk" = "Chennai",
"MA Chidambaram Stadium, Chepauk, Chennai" = "Chennai",
"MA Chidambaram Stadium" = "Chennai",
"Brabourne Stadium" = "Mumbai",
"Brabourne Stadium, Mumbai" = "Mumbai",
"Narendra Modi Stadium, Ahmedabad" = "Ahmedabad",
"Sardar Patel Stadium, Motera" = "Ahmedabad",
"Himachal Pradesh Cricket Association Stadium" = "Dharamshala",
"Himachal Pradesh Cricket Association Stadium, Dharamsala" = "Dharamshala",
"Punjab Cricket Association Stadium, Mohali" = "Mohali",
"Punjab Cricket Association IS Bindra Stadium" = "Mohali",
"Punjab Cricket Association IS Bindra Stadium, Mohali" = "Mohali",
"Punjab Cricket Association IS Bindra Stadium, Mohali, Chandigarh" = "Mohali",
"Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium" = "Vizag",
"Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium, Visakhapatnam" = "Vizag",
"Maharashtra Cricket Association Stadium" = "Pune",
"Maharashtra Cricket Association Stadium, Pune" = "Pune",
"Sawai Mansingh Stadium, Jaipur" = "Jaipur",
"Sawai Mansingh Stadium" = "Jaipur",
"Barabati Stadium" = "Cuttack",
"Green Park" = "Kanpur",
"Holkar Cricket Stadium" = "Indore",
"JSCA International Stadium Complex" = "Ranchi",
"Barsapara Cricket Stadium, Guwahati" = "Guwahati",
"Nehru Stadium" = "Kochi",
"Saurashtra Cricket Association Stadium" = "Rajkot",
"Subrata Roy Sahara Stadium" = "Pune",
"Shaheed Veer Narayan Singh International Stadium" = "Raipur",
"Vidarbha Cricket Association Stadium, Jamtha" = "Nagpur",
"Maharaja Yadavindra Singh International Cricket Stadium, Mullanpur" = "Chandigarh",
"Bharat Ratna Shri Atal Bihari Vajpayee Ekana Cricket Stadium, Lucknow" = "Lucknow"
)
# Applying venue → city mapping
IPL_city <- IPL |>
mutate(
venue_city = recode(venue, !!!venue_city_map)
)
# Creating match-level dataset
matches_city_df <- IPL_city |>
distinct(match_id, .keep_all = TRUE)
# Count matches by city
matches_by_city <- matches_city_df |>
group_by(venue_city) |>
summarise(match_count = n()) |>
arrange(desc(match_count))
matches_by_city |>
ggplot() +
geom_col(
mapping = aes(
x = reorder(venue_city, match_count),
y = match_count,
),
fill = "steelblue"
) +
geom_text(
mapping = aes(
x = venue_city,
y = match_count,
label = match_count
),
hjust = -0.2,
size = 3
) +
coord_flip() +
labs(
title = "Number of IPL Matches by City ( after Venue Mapping)",
x = "City",
y = "Number of Matches"
) +
theme_minimal()
By consolidating all variations of a stadium name into a single city label, the chart provides a much clearer and accurate picture of where matches have been played. After mapping, the inconsistencies in venue names are resolved and the cleaned visualization makes city‑level comparisons more meaningful and provides reliable analysis of hosting patterns and trends.
Failing to standardize venue names before analysis introduces significant risks like incorrect city‑level insights and misleading geographic patterns.
ipl_raw |> summarise(explicit_missing_date = sum(is.na(date)))
## # A tibble: 1 × 1
## explicit_missing_date
## <int>
## 1 0
ipl_raw |> summarise(explicit_missing_team = sum(is.na(team)))
## # A tibble: 1 × 1
## explicit_missing_team
## <int>
## 1 0
ipl_raw |> filter(is.na(venue) | is.na(team))
## # A tibble: 0 × 22
## # ℹ 22 variables: match_id <dbl>, player <chr>, team <chr>, runs <dbl>,
## # balls_faced <dbl>, fours <dbl>, sixes <dbl>, wickets <dbl>,
## # overs_bowled <dbl>, balls_bowled <dbl>, runs_conceded <dbl>, catches <dbl>,
## # run_outs <dbl>, maiden <dbl>, stumps <dbl>, match_outcome <chr>,
## # opposition_team <chr>, strike_rate <dbl>, economy <dbl>,
## # fantasy_points <dbl>, venue <chr>, date <date>
checked the date and team columns in the raw IPL dataset, both returned 0 explicitly missing rows. This means neither column contains any NA values or blank entries. Every row in the dataset has a valid date and team recorded. Because of this, there is no evidence of data loss or incomplete entries for these two categorical variables at the explicit level.
expected_teams <- c(
"Mumbai Indians", "Chennai Super Kings", "Royal Challengers Bengaluru",
"Delhi Capitals", "Kolkata Knight Riders", "Sunrisers Hyderabad",
"Rajasthan Royals", "Punjab Kings", "Gujarat Titans", "Lucknow Super Giants",
"Deccan Chargers", "Gujarat Lions", "Kochi Tuskers Kerala", "Pune Warriors",
"Rising Pune Supergiant"
)
actual_teams <- unique(ipl_raw$team)
setdiff(expected_teams, actual_teams)
## character(0)
When comparing the expected list of IPL franchises with the actual team categories in the dataset, the result returned character(0). This means that no team categories are implicitly missing — every franchise that has existed in the IPL appears at least once in the data.
ipl_raw |>
mutate(season = year(date)) |>
distinct(match_id, season) |>
count(season)|>
print(n=Inf)
## # A tibble: 18 × 2
## season n
## <dbl> <int>
## 1 2008 58
## 2 2009 57
## 3 2010 60
## 4 2011 73
## 5 2012 74
## 6 2013 76
## 7 2014 60
## 8 2015 59
## 9 2016 60
## 10 2017 59
## 11 2018 60
## 12 2019 60
## 13 2020 60
## 14 2021 60
## 15 2022 74
## 16 2023 74
## 17 2024 71
## 18 2025 5
Created season from date column and counted matches per season. The results show that all seasons from 2008 to 2024 have a normal number of matches, but the 2025 season contains only 5 matches. A typical IPL season has around 60–74 matches, so the extremely low count for 2025 indicates that this season is incomplete. Although there are no NA values in the date column, the missing matches create implicit missingness, because the category “2025” exists but is missing a large portion of its expected data.
ipl_raw_clean <- ipl_raw |>
mutate(
team = case_when(
str_detect(team, "Delhi") ~ "Delhi Daredevils",
str_detect(team, "Punjab") ~ "Kings XI Punjab",
str_detect(team, "Rising Pune") ~ "Rising Pune Supergiant",
str_detect(team, "Royal Challengers") ~ "Royal Challengers Bangalore",
TRUE ~ team
)
)
all_combos <- expand_grid(
team = unique(ipl_raw_clean$team),
season = unique(year(ipl_raw_clean$date))
)
actual_groups <- ipl_raw_clean |>
mutate(season = year(date)) |>
distinct(team, season)
empty_groups <- anti_join(all_combos, actual_groups, by = c("team", "season"))
empty_groups
## # A tibble: 114 × 2
## team season
## <chr> <dbl>
## 1 Lucknow Super Giants 2013
## 2 Lucknow Super Giants 2008
## 3 Lucknow Super Giants 2011
## 4 Lucknow Super Giants 2016
## 5 Lucknow Super Giants 2014
## 6 Lucknow Super Giants 2015
## 7 Lucknow Super Giants 2012
## 8 Lucknow Super Giants 2018
## 9 Lucknow Super Giants 2017
## 10 Lucknow Super Giants 2020
## # ℹ 104 more rows
For empty groups, first standardized team names to remove inconsistencies (e.g., “Royal Challengers Bangalore” vs. “Royal Challengers Bengaluru”). Generated all possible team × season combinations and compared it with the combinations that appear in the dataset. The resulting empty groups represent team–season pairs where the team exists in the dataset and the season exists in the dataset, but the team has no observations for that season.
These are not missing values; they simply reflect participation of different teams over all IPL season. For example, Lucknow Super Giants appear as empty groups for seasons before 2022 because the franchise did not exist then, and older teams like Deccan Chargers or Kochi Tuskers Kerala appear only in the seasons in which they participated. Empty groups therefore indicate valid category combinations with zero rows due to team participation history, not due to missing data.
I selected the economy column, which measures the number of runs a player concedes per over. Economy rate is a continuous variable and is known to be right‑skewed in cricket, because a few extremely expensive spells can inflate the upper tail of the distribution.
For the economy rate variable, an outlier represents a bowling performance that is unusually economical or unusually expensive compared to the typical range of career economy rates. To identify such extreme values, I use the IQR . Calculated the first quartile (Q1) and third quartile (Q3) of the economy rate distribution, then compute the IQR as Q3-Q1 and compute the IQR as the difference between them.
Any player whose career economy rate falls outside the whiskers of the boxplot is considered an outlier. These whiskers are based on the IQR rule and capture the expected spread of the data, so values beyond them represent truly exceptional long‑term performance patterns rather than one‑off match fluctuations.
# Computing career totals per bowler
career_stats <- ipl_raw_clean |>
group_by(player) |>
summarise(
total_wkts = sum(wickets, na.rm = TRUE),
total_runs = sum(runs_conceded, na.rm = TRUE),
total_overs = sum(overs_bowled, na.rm = TRUE)
) |>
mutate(
career_economy = total_runs / total_overs
) |>
drop_na(career_economy)
# Selecting top 100 players by wickets
top100 <- career_stats |>
arrange(desc(total_wkts)) |>
slice_head(n = 100)
# Computing IQR thresholds
iqr_vals <- top100 |>
summarise(
Q1 = quantile(career_economy, 0.25),
Q3 = quantile(career_economy, 0.75)
) |>
mutate(
IQR = Q3 - Q1,
lower = Q1 - 1.5 * IQR,
upper = Q3 + 1.5 * IQR
)
lower_thr <- iqr_vals$lower
upper_thr <- iqr_vals$upper
#Flagging outliers
top100 <- top100 |>
mutate(
outlier = career_economy < lower_thr | career_economy > upper_thr
)
# Boxplot
ggplot(top100) +
geom_boxplot(aes(x = "", y = career_economy), fill = "lightblue") +
geom_jitter(
aes(x = "", y = career_economy, color = outlier),
width = 0.15,
alpha = 0.7
) +
scale_color_manual(values = c("FALSE" = "black", "TRUE" = "red")) +
labs(
title = "Career Economy Rate for Top 100 Wicket-Takers",
x = "",
y = "Career Economy Rate",
color = "Outlier"
) +
theme_minimal()
The boxplot shows the distribution of career economy rates for the top 100 wicket‑taking bowlers in the dataset. The central box represents the middle 50% of bowlers, with the median economy rate lying near the center of the box. The relatively narrow interquartile range indicates that most top bowlers maintain economy rates within a fairly consistent band. The whiskers extend to the minimum and maximum values that fall within the IQR rule, and in this case, no points fall outside the whiskers, meaning there are no statistical outliers in the career economy rates of these top performers. This suggests that among bowlers with substantial careers, economy rates tend to cluster tightly, and extreme long‑term performances either exceptionally economical or unusually expensive are rare. Overall, the distribution reflects a stable and balanced set of bowling performances across the top wicket‑takers.
further question:- Do certain venues produce more high‑economy outliers?