Data Preparation

#Load Libraries
library(dplyr)
library(tidyverse)
library(stringr)
library(RCurl)
library(ggplot2)

# load data
p<-read.csv(file = 'people.csv', stringsAsFactors =FALSE)
b<-read.csv(file = 'batting.csv', stringsAsFactors =FALSE)

#Join Data
a<-inner_join(p, b, by="playerID")%>%
    select(nameGiven,yearID, HR, H)%>%
    group_by(nameGiven, yearID)

Research question

Baseball began testing for steroid/PEDS usage in 2003. There is a debate on wether a player that has tested positive should be elected to the hall of fame.

RESEARCH QUESTION: Will comparing stats of players that tested positive for steroids vs non-steroid player post 2003 vary enough to be usable to predict if any Hall of Fame players pre 2003 may have used steroids or performance enhancing drugs.

Cases

The cases are all major players. Total of 12717

Data collection

The initial data will come from www.seanlahman.com/baseball-archive/statistics/. Additional sources will be needed to retreived to extract players that have tested postitive of steroids.

Type of study

This study will be in observational study as it will take data from past baseball statistics

Data Source

The data was collected from www.seanlahman.com/baseball-archive/statistics/

Dependent Variable

The dependent variable is steroid use and is qualitative

Independent Variable

The indpendent variables are the baseball statistics. The statistics will be quantitative. Where they have already been tested positive would be a quantitative variable

Relevant summary statistics

Both HR and H qplots are t skewed. Comparision of H to HR are mostly right skewed, expectation is 40 Hr/250 H would be Hall of fame caliber player

glimpse(a)
## Observations: 104,324
## Variables: 4
## $ nameGiven <chr> "David Allan", "David Allan", "David Allan", "David ...
## $ yearID    <int> 2004, 2006, 2007, 2008, 2009, 2010, 2012, 2013, 2015...
## $ HR        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 27, 26, 44, 30, 39, 4...
## $ H         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 131, 189, 200, 198, 196, ...
summary(a$HR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.832   2.000  73.000
summary(a$H)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    9.00   37.01   57.00  262.00
length(unique(a$nameGiven))
## [1] 12717
qplot(a$HR)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(a$H)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(a$HR)

plot(a$H~a$HR)

#dotchart(a$H,labels=a$nameGiven)