Introduction

This Data Dive explores the IPL Player Performance Dataset by Anova testing and building regression models.

One-way ANOVA testing :- to compare average team innings scores across venues
Simple linear regression:-to model runs as a function of boundary hitting.

ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")

## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): player, team, match_outcome, opposition_team, venue
## dbl  (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.

IPL <- ipl_raw |>
  mutate(
    date = as.Date(date),
    season = year(date)
  ) |>
  filter(season < 2025)

Variable selection

Response variable (continuous):

innings_runs — the total number of runs scored by a team in a single innings.

Runs per innings are calculted by aggregating runs by match_id and team

Explanatory variable (categorical):

venue_city — the city in which the match was played.

Venue names are first mapped to cities to reduce duplication and standardize locations.

One-way ANOVA

The goal of this hypothesis is to test whether average per innings score (runs) differ across major IPL cities.

Hypothesis

Null Hypothesis (H₀):The mean team innings runs are equal across all selected venue cities.

\[ \LARGE \mu_{\text{innings_runs, City}_1} = \mu_{\text{innings_runs, City}_2} = \dots = \mu_{\text{innings_runs, City}_k} \]

Alternative Hypothesis (H₁):At least one venue city has a different mean team innings runs.

\[\LARGE \mu_{\text{innings_runs, City}_j} \neq \mu_{\text{innings_runs, City}_m} \quad \text{for at least one pair } (j, m) \]

#Creating the venue → city mapping
venue_city_map <- c(
  "M Chinnaswamy Stadium" = "Bengaluru",
  "M.Chinnaswamy Stadium" = "Bengaluru",
  "M Chinnaswamy Stadium, Bengaluru" = "Bengaluru",
  "Dr DY Patil Sports Academy, Mumbai" = "Mumbai",
  "Dr DY Patil Sports Academy" = "Mumbai",
  "Eden Gardens, Kolkata" = "Kolkata",
  "Eden Gardens" = "Kolkata",
  "Wankhede Stadium, Mumbai" = "Mumbai",
  "Wankhede Stadium" = "Mumbai",
  "Rajiv Gandhi International Stadium, Uppal" = "Hyderabad",
  "Rajiv Gandhi International Stadium, Uppal, Hyderabad" = "Hyderabad",
  "Rajiv Gandhi International Stadium" = "Hyderabad",
  "Feroz Shah Kotla" = "Delhi",
  "Arun Jaitley Stadium" = "Delhi",
  "Arun Jaitley Stadium, Delhi" = "Delhi",
  "Dubai International Cricket Stadium" = "Other",
  "Sheikh Zayed Stadium" = "Other",
  "Sharjah Cricket Stadium" = "Other",
  "Zayed Cricket Stadium, Abu Dhabi" = "Other",
  "SuperSport Park" = "Other",
  "Kingsmead" = "Other",
  "St George's Park" = "Other",
  "Newlands" = "Other",
  "Buffalo Park" = "Other",
  "OUTsurance Oval" = "Other",
  "New Wanderers Stadium" = "Other",
  "De Beers Diamond Oval" = "Other",
  "MA Chidambaram Stadium, Chepauk" = "Chennai",
  "MA Chidambaram Stadium, Chepauk, Chennai" = "Chennai",
  "MA Chidambaram Stadium" = "Chennai",
  "Brabourne Stadium" = "Mumbai",
  "Brabourne Stadium, Mumbai" = "Mumbai",
  "Narendra Modi Stadium, Ahmedabad" = "Ahmedabad",
  "Sardar Patel Stadium, Motera" = "Ahmedabad",
  "Himachal Pradesh Cricket Association Stadium" = "Dharamshala",
  "Himachal Pradesh Cricket Association Stadium, Dharamsala" = "Dharamshala",
  "Punjab Cricket Association Stadium, Mohali" = "Mohali",
  "Punjab Cricket Association IS Bindra Stadium" = "Mohali",
  "Punjab Cricket Association IS Bindra Stadium, Mohali" = "Mohali",
  "Punjab Cricket Association IS Bindra Stadium, Mohali, Chandigarh" = "Mohali",
  "Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium" = "Vizag",
  "Dr. Y.S. Rajasekhara Reddy ACA-VDCA Cricket Stadium, Visakhapatnam" = "Vizag",
  "Maharashtra Cricket Association Stadium" = "Pune",
  "Maharashtra Cricket Association Stadium, Pune" = "Pune",
  "Sawai Mansingh Stadium, Jaipur" = "Jaipur",
  "Sawai Mansingh Stadium" = "Jaipur",
  "Barabati Stadium" = "Cuttack",
  "Green Park" = "Kanpur",
  "Holkar Cricket Stadium" = "Indore",
  "JSCA International Stadium Complex" = "Ranchi",
  "Barsapara Cricket Stadium, Guwahati" = "Guwahati",
  "Nehru Stadium" = "Kochi",
  "Saurashtra Cricket Association Stadium" = "Rajkot",
  "Subrata Roy Sahara Stadium" = "Pune",
  "Shaheed Veer Narayan Singh International Stadium" = "Raipur",
  "Vidarbha Cricket Association Stadium, Jamtha" = "Nagpur",
  "Maharaja Yadavindra Singh International Cricket Stadium, Mullanpur" = "Chandigarh"
)

# Applying the Mapping
IPL <- IPL |>
  mutate(
    venue_city = venue_city_map[venue]
  )
IPL <- IPL |>
  mutate(
    venue_city = as.character(venue_city_map[venue])
  )

#Filter out neutral venues 
IPL <- IPL |>
  filter(venue_city != "Other")

# Create innings-level totals
innings_totals <- IPL |>
  group_by(match_id, team, venue_city) |>
  summarise(
    innings_runs = sum(runs, na.rm = TRUE),
    .groups = "drop"
  )
head(innings_totals)

## # A tibble: 6 × 4
##   match_id team                        venue_city innings_runs
##      <dbl> <chr>                       <chr>             <dbl>
## 1   335982 Kolkata Knight Riders       Bengaluru           205
## 2   335982 Royal Challengers Bangalore Bengaluru            63
## 3   335983 Chennai Super Kings         Mohali              234
## 4   335983 Kings XI Punjab             Mohali              196
## 5   335984 Delhi Daredevils            Delhi               122
## 6   335984 Rajasthan Royals            Delhi               122

# Count number of innings per city
city_counts <- innings_totals |>
  count(venue_city, name = "n_innings") |>
  arrange(desc(n_innings))

city_counts

## # A tibble: 22 × 2
##    venue_city n_innings
##    <chr>          <int>
##  1 Mumbai           364
##  2 Bengaluru        188
##  3 Kolkata          186
##  4 Delhi            180
##  5 Chennai          170
##  6 Hyderabad        154
##  7 Mohali           122
##  8 Jaipur           114
##  9 Pune             102
## 10 Ahmedabad         72
## # ℹ 12 more rows

IPL |> select(venue, venue_city) |> head(20)

## # A tibble: 20 × 2
##    venue                                     venue_city
##    <chr>                                     <chr>     
##  1 M Chinnaswamy Stadium                     Bengaluru 
##  2 M Chinnaswamy Stadium                     Bengaluru 
##  3 M Chinnaswamy Stadium                     Bengaluru 
##  4 Dr DY Patil Sports Academy, Mumbai        Mumbai    
##  5 Eden Gardens, Kolkata                     Kolkata   
##  6 M Chinnaswamy Stadium                     Bengaluru 
##  7 M Chinnaswamy Stadium                     Bengaluru 
##  8 Wankhede Stadium, Mumbai                  Mumbai    
##  9 Wankhede Stadium                          Mumbai    
## 10 Wankhede Stadium                          Mumbai    
## 11 Rajiv Gandhi International Stadium, Uppal Hyderabad 
## 12 Feroz Shah Kotla                          Delhi     
## 13 Wankhede Stadium, Mumbai                  Mumbai    
## 14 Rajiv Gandhi International Stadium, Uppal Hyderabad 
## 15 Arun Jaitley Stadium                      Delhi     
## 16 Rajiv Gandhi International Stadium, Uppal Hyderabad 
## 17 MA Chidambaram Stadium, Chepauk           Chennai   
## 18 Brabourne Stadium                         Mumbai    
## 19 Wankhede Stadium                          Mumbai    
## 20 MA Chidambaram Stadium, Chepauk, Chennai  Chennai

# Select top 10 cities by number of innings
top_cities <- city_counts |>
  slice_max(n_innings, n = 10) |>
  pull(venue_city)

top_cities

##  [1] "Mumbai"    "Bengaluru" "Kolkata"   "Delhi"     "Chennai"   "Hyderabad"
##  [7] "Mohali"    "Jaipur"    "Pune"      "Ahmedabad"

#Filter innings data to only top cities
innings_top <- innings_totals |>
  filter(venue_city %in% top_cities)

head(innings_top)

## # A tibble: 6 × 4
##   match_id team                        venue_city innings_runs
##      <dbl> <chr>                       <chr>             <dbl>
## 1   335982 Kolkata Knight Riders       Bengaluru           205
## 2   335982 Royal Challengers Bangalore Bengaluru            63
## 3   335983 Chennai Super Kings         Mohali              234
## 4   335983 Kings XI Punjab             Mohali              196
## 5   335984 Delhi Daredevils            Delhi               122
## 6   335984 Rajasthan Royals            Delhi               122

To prepare the data for the ANOVA test, I first mapped each stadium in the dataset to its corresponding city using a custom venue_city_map, ensuring that all venues were consistently categorized at the city level, then removed neutral venues labeled as “Other”. Next, I aggregated data by grouping by match_id, team, and venue_city, and summing the total runs scored in each innings, then selected the top ten cities with the highest number of innings for ANOVA .

ANOVA

A one-way ANOVA model to test whether the mean team innings runs differ across venues. The response variable is \(innings\_run\), and the predictor is the categorical variable \(venue\_city\). The ANOVA table reports the F-statistic and the corresponding p-value.

A statistically significant p-value \((typically < 0.05)\) would indicate that at least one city has a different mean innings total compared to the others, providing evidence against the null hypothesis of equal means across all venues.

# Fit the one-way ANOVA model
anova_model <- aov(innings_runs ~ venue_city, data = innings_top)

summary(anova_model)

##               Df  Sum Sq Mean Sq F value Pr(>F)  
## venue_city     9   17428    1936   1.799 0.0638 .
## Residuals   1642 1767044    1076                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova Result:-\(F = 1.799\) and \(p = 0.0638\)

Since the p‑value is greater than the \(0.05\) significance level, we fail to reject the null hypothesis that all ten cities have the same mean innings runs. Although the city averages differ slightly, these differences are small relative to the large match‑to‑match variability within each venue. In practical terms, cities differ a little, but not enough for the differences to be statistically detectable, and the analysis does not provide evidence that venue city has a significant effect on innings runs.

Visualization

innings_top |>
  ggplot() +
  geom_boxplot(
    mapping = aes(
      x = venue_city,
      y = innings_runs,
      fill = venue_city
    ),
    outlier.alpha = 0.4
  ) +
  labs(
    title = "Distribution of Innings Runs Across 10 IPL Venue Cities",
    x = "Venue City",
    y = "Innings Runs"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

The boxplot shows that the distribution of innings runs is broadly similar across all ten IPL venue cities. While the medians vary slightly from city to city, the overall spread, interquartile ranges, and presence of outliers are largely comparable. No city displays a distinctly higher or lower scoring pattern, and the heavy overlap in distributions indicates that the differences between cities are small relative to the large match‑to‑match variability within each venue. This visual pattern aligns with the ANOVA result, reinforcing that cities differ a little, but not enough for the differences to be statistically detectable.

further question:-Would the results change if we analyzed only recent IPL seasons?

Linear Regression

To explore the relationship between batting performance and boundary‑hitting, I fit a simple linear regression model with \(runs\) as the response variable and \(boundaries\) ( \(fours\) + \(sixes\)) as the predictor.

# create boundaries
IPL <- IPL |>
  mutate(boundaries = fours + sixes)

# fit the regression model
model_runs <- lm(runs ~ boundaries, data = IPL)

summary(model_runs)

## 
## Call:
## lm(formula = runs ~ boundaries, data = IPL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.035  -1.786  -1.786   1.339  35.214 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.78571    0.04310   41.44   <2e-16 ***
## boundaries   6.62495    0.01242  533.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.169 on 19949 degrees of freedom
## Multiple R-squared:  0.9345, Adjusted R-squared:  0.9345 
## F-statistic: 2.846e+05 on 1 and 19949 DF,  p-value: < 2.2e-16

1. Intercept

The intercept represents the expected value of the response variable when the predictor is zero; from model_runs , a player with \(0\) boundaries is predicted to score approximately \(1.79\) runs, based on the estimate \(\hat{\beta}_0 = 1.7857\).

2. Slope (Coefficient of the Predictor)

The slope quantifies how much the response variable changes for a one‑unit increase in the predictor; here, each additional boundary is associated with an estimated increase of \(6.62\) runs, given by \(\hat{\beta}_1 = 6.62495\).

3. P‑value

The extremely small value \(p < 2 \times 10^{-16}\) indicates that boundaries are a highly significant predictor of runs.

4. Coefficient of Determination \((R^2)\)

The coefficient of determination measures the proportion of variance in the response explained by the model, an \((R^2 = 0.9345)\) means that \(93.45%\) of the variation in runs is explained by boundaries.

5. Residual Standard Error (RSE)

The residual standard error estimates the typical size of prediction errors, an RSE of \(5.169\) indicates that the model’s predicted runs differ from actual runs by about \(5\) runs on average.

The regression results show a strong and meaningful linear relationship between boundary‑hitting and total runs scored. The slope estimate of \(\hat {\beta }_1=6.62\) indicates that each additional boundary is associated with an average increase of about \(6.6\) runs, while the intercept \(\hat {\beta }_0=1.79\) represents the expected runs for a player who hits no boundaries. The predictor is highly statistically significant, with a p‑value less than \(2\times 10^{-16}\), providing strong evidence that boundaries contribute substantially to run‑scoring. The model explains \(93.45\%\) of the variation in runs \((R^2=0.9345)\), suggesting that boundary‑hitting alone accounts for most of the differences in player scoring. The residual standard error of approximately \(5.17\) runs indicates that the model’s predictions are typically within about five runs of the actual values. Overall, the regression model demonstrates that boundary‑hitting is a powerful and reliable predictor of batting performance in this dataset.

Visualization

IPL |>
  ggplot(aes(x = boundaries, y = runs)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = FALSE, color = "darkred", linewidth = 1) +
  labs(
    title = "Relationship Between Boundaries and Runs",
    x = "Total Boundaries (Fours + Sixes)",
    y = "Runs Scored"
  ) +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

The scatterplot clearly shows a strong positive linear relationship between total boundaries and runs scored. As the number of boundaries increases, the points rise sharply along the vertical axis, indicating that players who hit more boundaries consistently score more runs. The fitted regression line captures this trend effectively, with most points lying close to the line, reflecting the high \(R^2\) value from the model. The tight clustering around the line suggests that boundary‑hitting is a highly reliable predictor of run‑scoring, with relatively little unexplained variation. Overall, the visualization reinforces the numerical results by showing that boundary‑hitting is strongly and linearly associated with batting performance.

further question :- Does the linear relationship between boundaries and runs remain consistent across different player /team?

Week8_Datadive

Mayank Gupta

2026-03-03

Introduction

Variable selection

One-way ANOVA

Hypothesis

ANOVA

Visualization

Linear Regression

Visualization