Week-4 : (Data-Dive)

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(patchwork)
Data_set <- "/Users/ba/Documents/IUPUI/Masters/First Sem/Statistics/Dataset/PitchingPost.csv"
Pitching_Data <- read.csv(Data_set)
  1. Lets create 5 random samples from our database using replacement
library(dplyr)

set.seed(123)
sample_size <- nrow(Pitching_Data) * 0.5

df_1 <- Pitching_Data %>% sample_n(size = sample_size, replace = TRUE)
df_2 <- Pitching_Data %>% sample_n(size = sample_size, replace = TRUE)
df_3 <- Pitching_Data %>% sample_n(size = sample_size, replace = TRUE)
df_4 <- Pitching_Data %>% sample_n(size = sample_size, replace = TRUE)
df_5 <- Pitching_Data %>% sample_n(size = sample_size, replace = TRUE)
  1. Now in the last Data-Dive we compared three variables Runs Allowed, Earned Runs and Earned Runs Average to determine which pitcher is best and how one metric could be more reliable than the other. Now lets see how all these three metrics come out of each of the sub-samples we have created.
    • First we will go through each one of these variables at a time and create bins to group them into performance-categories and then find out how probable these are to occur in different samples and see if the probability metrics differ among these 5 samples.
plot2

plot3

plot4

plot5

Here we can see that all the performance categories (Best, Average and Worst) all have close values with a difference in the range of “0.03” unit probability

plot7

plot8

plot9

plot10

Even for the Earned Runs we can see that all the performance categories (Best, Average and Worst) all have close values with a difference in the range of “0.03” unit probability and sample-2 seems to have the lowest probability in “Best” performance category and sample-3 seems to have the highest probability in the “Best” performance category when compared to all the 5 sub-samples.

plot12

plot13

plot14

plot15

In this comparison of the Earned Run Average among the 5 sub-samples we can see that sample-3, sample-4 and sample-5 give a consistently same output where sample-1 and sample-2 have slightly different values.

  1. Monte Carlo Simulation

Now lets implement Monte Carlo simulation on the variable “Runs Allowed” to see if it is any close to the actual mean of the variable “Runs Allowed”

samples_Pitching_Data <- rnorm(1000, mean(Pitching_Data$R), sd(Pitching_Data$R))

simulations_Pitching_data <- replicate(1000, mean(sample(samples_Pitching_Data, 100, replace = TRUE)))

hist(simulations_Pitching_data, main = "Histogram of Simulations for Dataset 1")

print(mean(Pitching_Data$R))
## [1] 1.790667

We can see that the mean of “Run Allowed” is close to the mean we acquired through the Monte Carlo’s Simulation.

df_1 <- df_1[is.finite(df_1$ERA), ]

df_1 |>
  ggplot(aes(x="sample-1",y=ERA))+
  stat_boxplot()+
  labs(x="Sample 1",y="ERA Distribution",title="Finding the anomalies in ERA from Sample 1")+
  theme_classic()

df_2 <- df_2[is.finite(df_2$ERA), ]

df_2 |>
  ggplot(aes(x="sample-2",y=ERA))+
  stat_boxplot()+
  labs(x="Sample 1",y="ERA Distribution",title="Finding the anomalies in ERA from Sample 2")+
  theme_classic()

df_3 <- df_3[is.finite(df_3$ERA), ]

df_3 |>
  ggplot(aes(x="sample-3",y=ERA))+
  stat_boxplot()+
  labs(x="Sample 1",y="ERA Distribution",title="Finding the anomalies in ERA from Sample 3")+
  theme_classic()

df_4 <- df_4[is.finite(df_4$ERA), ]

df_4 |>
  ggplot(aes(x="sample-4",y=ERA))+
  stat_boxplot()+
  labs(x="Sample 4",y="ERA Distribution",title="Finding the anomalies in ERA from Sample 4")+
  theme_classic()

df_5 <- df_5[is.finite(df_5$ERA), ]

df_5 |>
  ggplot(aes(x="sample-5",y=ERA))+
  stat_boxplot()+
  labs(x="Sample 5",y="ERA Distribution",title="Finding the anomalies in ERA from Sample 5")+
  theme_classic()

When we compare ERA among all the 5 samples, we can see that sampling was done really good because most of the outliers were very similar among all the samples, outliers samples 2,3 are very similar but slightly different from samples 1,4,5.