Ahlad_Week6

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(dplyr)

Lets build two pairs of variables from the dataset.

data <- read.csv("~/Downloads/archive/matches.csv")
# Pair 1: target_runs (Explanatory) and result_margin (Response)
data <- data %>%
  mutate(win_type = ifelse(result == "runs", "Won by Runs", "Won by Wickets"))
# Pair 2: target_runs (Explanatory) and target_overs (Response)
data <- data %>%
  mutate(average_run_rate = target_runs / target_overs)

# print
head(data %>% select(target_runs, result_margin, win_type, target_overs, average_run_rate))

We can say that the win_type here helps to know if the match has been won by runs or wickets. The average run rate here explains the average runs scored for every over in the matches. But we have considered only defending team, we have not taken calculations of chasing team. It is an derived attribute from target runs and target overs. These two pairs helps users to analyze the win type of teams and average run rate for defending team.

Lets see some visualizations to understand these variables better.

data <- data %>%
  mutate(win_type = ifelse(result == "runs", "Runs", "Wickets"))
ggplot(data, aes(x = target_runs, y = result_margin)) +
  geom_point(aes(color = win_type), alpha = 0.6) +
  labs(title = "Target Runs vs Result Margin", x = "Target Runs", y = "Result Margin") +
  theme_minimal()

## Warning: Removed 19 rows containing missing values or values outside the scale range
## (`geom_point()`).

The visualization here depicts that most number of times the matches have been won within the range of 130-180. We can say that when most of the times when matches are won by wickets do tend to have low margin compared to that of won by runs.There are some outliers here where result margin is above 100, which means the matches have been onesided. And some matches in which the target is high end up close encounters and some matches with low targets resulted in high result margin.

data <- data %>%
  mutate(average_run_rate = target_runs / target_overs)
ggplot(data, aes(x = target_runs, y = average_run_rate)) +
  geom_point(alpha = 0.6, color = "blue") +
  labs(title = "Target Runs vs Average Run Rate", x = "Target Runs", y = "Average Run Rate") +
  theme_minimal()

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

This visualization helps us to understand that in general the average run rate will increase whenever the target runs increase. This statement is true as the high target runs often means players to score more runs per over. The outliers here represent that in some cases where matches have been played for less overs often encourage players to score faster, thus the average run rate is higher based on these conditions. From these visualizations on response variables we can conclude above conclusions.

Lets find the correlation coefficients for these combinations. As both variables are continuous, and the relationship between them is linear we can use pearson coefficient.

# Correlation for Pair 1: target_runs vs result_margin
correlation_1 <- cor(data$target_runs, data$result_margin, use = "complete.obs", method = "pearson")
cat("Correlation between Target Runs and Result Margin:", correlation_1, "\n")

## Correlation between Target Runs and Result Margin: 0.3951201

# Correlation for Pair 2: target_runs vs average_run_rate
correlation_2 <- cor(data$target_runs, data$average_run_rate, use = "complete.obs", method = "pearson")
cat("Correlation between Target Runs and Average Run Rate:", correlation_2, "\n")

## Correlation between Target Runs and Average Run Rate: 0.8825475

The coefficient between pair 1 is 0.39. It represents weak correlation as the relation between target runs and result margin is not strong.Which means there are some cases when the higher target runs resulted in lower result margin.

The coefficient between pair 2 is 0.89. Which means it is a strong linear relation between target runs and avg run rate. The average run rate increased linearly with respect to that of target runs.

We can observe the same thing from the visualizations too, so these coefficients make sense but not entirely.

Lets build Confidence intervals for the response variables from both pairs

# Confidence Interval for response variable: result_margin (Pair 1)
ci_result_margin <- t.test(data$result_margin, conf.level = 0.95)
cat("95% Confidence Interval for Result Margin:\n", ci_result_margin$conf.int, "\n")

## 95% Confidence Interval for Result Margin:
##  15.95601 18.56257

# Confidence Interval for response variable: average_run_rate (Pair 2)
ci_avg_run_rate <- t.test(data$average_run_rate, conf.level = 0.95)
cat("95% Confidence Interval for Average Run Rate:\n", ci_avg_run_rate$conf.int, "\n")

## 95% Confidence Interval for Average Run Rate:
##  8.304006 8.495184

The confidence interval provides a range in which we are 95% confident that the true mean result margin for the population lies in this range. Based on the interval, we can say that the result margin with larger target runs tend to result in bigger result margin. Similarly, the confidence interval for the average run rate provides an estimate of the range for the true population mean run rate. As the interval is narrow and contains high values, it means teams generally score at a fast pace to chase large targets.

Ahlad_Week6

The coefficient between pair 1 is 0.39. It represents weak correlation as the relation between target runs and result margin is not strong.Which means there are some cases when the higher target runs resulted in lower result margin.

The coefficient between pair 2 is 0.89. Which means it is a strong linear relation between target runs and avg run rate. The average run rate increased linearly with respect to that of target runs.

We can observe the same thing from the visualizations too, so these coefficients make sense but not entirely.

Lets build Confidence intervals for the response variables from both pairs

These are some of the conclusions I drew from these pair of variables.