Ahlad_Data Discovery

Description:

I have selected a dataset on ipl matches, which is a popular cricket league from India. It includes the details of individual matches played from 2008 to 2024. It summarizes the match by providing details of teams played, toss and match winner, match type and target runs etc. This helps us to understand the result of the match and how every team performed over the years. The link where I found the dataset and its documentation is https://www.kaggle.com/datasets/patrickb1912/ipl-complete-dataset-20082020?select=matches.csv

Even though the details are provided in full, there are some missing values and anamolies in the dataset. The main purpose of my project is to create a clean database with various statistical approaches which helps in drawing conclusions on how teams performed overtime. I also have some questions on different columns which can be analysed and modified to enable user accessibility. I want to draw some analysis on how a team performs against a specific team over the years.

I want to create a relation between two analyses like the result margin variance when two specific teams are played against each other. There are many things we can investigate on this dataset like which team is the top one in all attributes and analyses help when those two teams are head to head in future. Although it depends on players, conditions etc., I want to know the team with upper hand in individual aspects when they are played against each other. It is cool to have knowledge about teams with proper analyses.This also helps in sports analytics for cricket and helps user to know the trends.

Lets see some visualizations to understand more about the dataset.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data <- read.csv("~/Documents/Rdocs/matches.csv", stringsAsFactors = TRUE)

ggplot(data, aes(x = winner, y = result_margin, color = winner)) +
  geom_point(size = 2) +
  labs(title = "Scatter Plot of winner by result margin",
       x = "winner", y = "result margin") +
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

## Warning: Removed 19 rows containing missing values or values outside the scale range
## (`geom_point()`).

We can see how different teams won in terms of result margin. But from this visualization it is unclear to compare two teams as there are repeated teams in the dataset. It has to be clarified to user to get combined result for same teams, like punjab kings and kings xi punjab.

Initially I found some missing data in attributes and also the formatting of some columns is required. I found that there are relations between result margin and match type. Lets see few hypothesis visually.

Hypotheses:

result_margin_result_summary <- data |>
  mutate(result_margin_range = cut(result_margin,breaks = c(0, 50, 100, 150, 200, Inf), labels = c("0-50", "50-100", "100-150", "150-200", "200-300"))) |>
  group_by(result_margin_range, result) |>
  summarise(
    Avg_target_runs = mean(target_runs),
    Avg_target_overs = mean(target_overs)
  )

## `summarise()` has grouped output by 'result_margin_range'. You can override
## using the `.groups` argument.

#lets use bar graph for visualization
result_margin_result_bar <- ggplot(result_margin_result_summary, aes(x = result_margin_range, y = Avg_target_runs, fill = result)) +
  geom_bar(stat = "identity") +
  labs(title = "Average target runs for result margin by result",
       x = "result_margin_range ",
       y = " Avg_target_runs",
       fill = "result") +
  theme_minimal()

# Display the stacked bar chart
print(result_margin_result_bar)

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_bar()`).

The matches which are won by wickets have been low target matches. Higher target runs results higher win rate in most times. This is a correlation between result and target runs.

Lets look at another hypothesis.

match_type_season_summary <- data |>
  group_by(match_type,season) |>
  summarise(
    Avg_result_margin = mean(result_margin),
    Avg_target_overs = mean(target_overs),
    Median_target_runs = median(target_runs)
  )

## `summarise()` has grouped output by 'match_type'. You can override using the
## `.groups` argument.

# Here I am checking the average result margin of matches over seasons by match type.
match_type_season_densitymap <- ggplot(match_type_season_summary, aes(x = season, y = match_type, fill = Avg_result_margin)) +
  geom_tile() +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Average Result Margin of matches over seasons by match_type",
       x = "season",
       y = "match_type",
       fill = "Avg_result_margin") +
  theme_minimal()

# Display the densitymap
print(match_type_season_densitymap)

Matches played in finals have been close encounters as the average result margin is low compared to that of league matches. As the pressure on players is more in finals, everyone try to perform well. This makes the matches get decided with few runs difference.

This is the dataset that I discovered and I want to continue exploring this to create a sports analytic model from this dataset which helps me analyze ipl matches better.