week-5.knit

Week-5 (Data-Dive)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(patchwork)
Data_set <- "/Users/ba/Documents/IUPUI/Masters/First Sem/Statistics/Dataset/PitchingPost.csv"
Pitching_Data <- read.csv(Data_set)

A list of at least 3 columns (or values) in your data which are unclear until you read the documentation.

round

The specific levels or stages of playoffs are not explicitly defined in the documentation. Without further context, it’s unclear if “round” refers to divisional playoffs, league championship series, or the World Series.

head(Pitching_Data$round)

## [1] "WS"    "NLDS2" "NLCS"  "NLDS1" "NLDS1" "NLDS2"

lgID

It’s unclear what the abbreviations AA, AL, and NL stand for without further context. Are they American Association, American League, and National League?

head(Pitching_Data$lgID)

## [1] "NL" "NL" "NL" "NL" "NL" "NL"

GIDP

The abbreviation “GIDP” might be unclear to those unfamiliar with baseball statistics. Without further explanation, users might not understand that it stands for “Grounded Into Double Plays”

head(Pitching_Data$GIDP)

## [1] 1 0 1 0 0 0

Why do you think they chose to encode the data the way they did?

The data encoding likely follows standard conventions in sports like baseball. Baseball statistics often use abbreviations and notations that are familiar to analysts and enthusiasts in the baseball community. By encoding the data in this familiar format, the dataset maintains consistency with existing baseball databases, facilitating analysis and comparison across datasets.

What could have happened if you didn’t read the documentation?

Without reading the documentation, I might’ve misinterpret or misunderstood the encoded data, which could lead to inaccurate analysis. For example, lets say I have mistakenly assumed that the abbreviation “ERA” refers to a cricket statistic rather than a baseball statistic, they might incorrectly interpret the data or make erroneous comparisons between baseball and cricket statistics. Lack of understanding of the encoding conventions and meanings of specific variables could lead to flawed conclusions and undermine the reliability of the analysis.

At least one element or your data that is unclear even after reading the documentation

Despite reading the documentation, the specific calculation method for “ERA” might remain unclear to users unfamiliar with baseball statistics. While the documentation provides the definition of Earned Run Average, it does not detail the formula used to calculate it. Without understanding the calculation method, users might struggle to interpret or analyze the “ERA” values effectively.

Earned Run Average (ERA) in baseball with another similar calculation that people might get confused with, such as Average Run Rate (ARR) in cricket.

head(Pitching_Data$ERA)

## [1]   1.59   6.75   0.00   1.04  11.25 108.00

In baseball, Earned Run Average (ERA) is calculated by dividing the total number of earned runs allowed by the pitcher by the total number of innings pitched, and then multiplying by nine. whereas in cricket, Average Run Rate (ARR) is a measure of a batsman’s scoring rate, calculated by dividing the total number of runs scored by the total number of innings played.

earned_runs <- Pitching_Data$ER  
innings_pitched <- Pitching_Data$IPouts / 3 / 9 

calculate_era <- function(er, ip) {
  era <- er / ip
  return(era)
}

baseball_era <- calculate_era(earned_runs, innings_pitched)
head(baseball_era)

## [1]   1.588235   6.750000   0.000000   1.038462  11.250000 108.000000

total_runs <- Pitching_Data$ER  
innings_played <- Pitching_Data$G / 6

calculate_arr <- function(runs, innings) {
  arr <- runs / innings
  return(arr)
}

cricket_arr <- calculate_arr(total_runs, innings_played)
head(cricket_arr)

## [1]  6 12  0  6 15 12

data_comparison <- data.frame(Baseball_ERA = baseball_era, Cricket_ARR = cricket_arr)
head(data_comparison)

##   Baseball_ERA Cricket_ARR
## 1     1.588235           6
## 2     6.750000          12
## 3     0.000000           0
## 4     1.038462           6
## 5    11.250000          15
## 6   108.000000          12

Build a visualization which uses a column of data that is affected by the issue you brought up in bullet #2, above. In this visualization, find a way to highlight the issue, and explain what is unclear and why it might be unclear.

Here let’s visually represent how incase the ERA was calculated using a similar formula of ARR, lets see the difference in values

library(ggplot2)
comparison_data <- data.frame(Baseball_ERA = baseball_era, Cricket_ARR = cricket_arr)

ggplot(comparison_data, aes(x = Baseball_ERA, y = Cricket_ARR)) +
  geom_point(alpha = 0.5) +
  labs(title = "Comparison of Actual and Wrong Calculations",
       x = "Baseball ERA",
       y = "Cricket ARR") +
  theme_minimal()

## Warning: Removed 22 rows containing missing values (`geom_point()`).

Yes, there are significant risks associated with potential confusion between baseball ERA (Earned Run Average) and cricket ARR (Average Run Rate), especially if users are not familiar with the context of these statistics. - Data misinterpretation, Incorrect Analysis, Loss of credibility

To reduce negative consequences - clear documentation needs to be present.