---
title: "Ahlad-Data Discovery Stage 1"
output:
  html_notebook: default
  pdf_document: default
---

### AIM: What factors influence match outcomes in cricket, and how do toss decisions and venues impact the performance of teams?

#### Description:

I have selected a dataset on ipl matches, which is a popular cricket league from India. It includes the details of individual matches played from 2008 to 2024. It summarizes the match by providing details of teams played, toss and match winner, match type and target runs etc. This helps us to understand the result of the match and how every team performed over the years. The link where I found the dataset and its documentation is https://www.kaggle.com/datasets/patrickb1912/ipl-complete-dataset-20082020?select=matches.csv

#### Goal: 

The goal of this project is to analyze cricket match data to identify patterns and trends in match outcomes.
To understand the impact of toss decisions on match results.
To investigate whether specific venues provide an advantage to certain teams.
and finally to use these insights to inform strategies for teams or league organizers.

#### Visualizations : 

Understanding the distribution of matches over time reveals trends in tournament growth. A steady increase in matches could indicate more frequent scheduling, while dips might suggest unusual events or changes in format. Matches per Season highlights the historical context of the dataset, revealing changes in match frequency.

```{r}
library(dplyr)
library(ggplot2)

data <- read.csv("~/Documents/Rdocs/matches.csv", stringsAsFactors = TRUE)

head(data)
```

```{r}
library(ggplot2)

matches_per_season <- table(data$season)
matches_per_season_df <- as.data.frame(matches_per_season)

ggplot(matches_per_season_df, aes(x = Var1, y = Freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Number of Matches per Season", x = "Season", y = "Number of Matches") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

```

Toss decisions often play a strategic role in cricket.Examining the proportion of teams choosing to bat or field gives insights into suggesting strategies. This is particularly interesting when paired with win rates, as it sets the stage for investigating if one strategy is more successful than the other.This visualization provides an overview of teams' preferences, laying the foundation for deeper analysis of how toss decisions impact match outcomes.

```{r}
toss_decisions <- table(data$toss_decision)
toss_decisions_df <- as.data.frame(toss_decisions)

ggplot(toss_decisions_df, aes(x = "", y = Freq, fill = Var1)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  labs(title = "Toss Decisions: Bat vs. Field", x = NULL, y = NULL) +
  theme_void() +
  scale_fill_manual(values = c("orange", "skyblue")) +
  guides(fill = guide_legend(title = "Toss Decision"))

```


#### To-Do List:

1) Handle missing values, standardize date format for time-based analysis.

2) Analyze the impact of toss decisions on match outcomes using statistics and visualizations.

3) Study venue-wise performance to identify any home-ground advantage.

4) Use statistical methods to test hypotheses about toss decisions and venue effects on match outcome.

5) Create advanced visualizations to clearly communicate findings.


#### Initial Findings

Teams that choose to field after winning the toss have a higher probability of winning, as they can strategize better while chasing a target.
Certain venues may favor home teams, providing an advantage due to familiarity with pitch and conditions.

Lets compare win rates for teams that chose to bat vs. field after winning the toss.

```{r}
toss_impact <- data %>%
  mutate(toss_win_and_match_win = ifelse(toss_winner == winner, "Yes", "No")) %>%
  group_by(toss_decision, toss_win_and_match_win) %>%
  summarize(count = n()) %>%
  mutate(percentage = count / sum(count) * 100)

ggplot(toss_impact, aes(x = toss_decision, y = percentage, fill = toss_win_and_match_win)) +
  geom_bar(stat = "identity", position = "dodge") +
  theme_minimal() +
  labs(title = "Impact of Toss Decisions on Match Outcomes", x = "Toss Decision", y = "Percentage", fill = "Match Win")

```

Initially we can say that the teams that opt to field are most likely to win than that of to bat.
Now lets analyze win rates of home teams at their respective venues.

```{r}

venue_performance <- data %>%
  filter(team1 == winner | team2 == winner) %>%
  group_by(venue, winner) %>%
  summarize(wins = n()) %>%
  top_n(10, wins)

ggplot(venue_performance, aes(x = reorder(venue, wins), y = wins, fill = winner)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Venue Performance and Winning Teams", x = "Venue", y = "Number of Wins", fill = "Winning Team")

```

Although this is unclear, it still gives information about the relation between teams and their performance venue-wise.

I want to continue working on my goal and complete my to do list that I mentioned above, which makes this project to meet its AIM.
