#Load Libraries
library(dplyr)
library(tidyverse)
library(stringr)
library(RCurl)
library(ggplot2)
# load data
p<-read.csv(file = 'people.csv', stringsAsFactors =FALSE)
b<-read.csv(file = 'batting.csv', stringsAsFactors =FALSE)
#Join Data
a<-inner_join(p, b, by="playerID")%>%
select(nameGiven,yearID, HR, H)%>%
group_by(nameGiven, yearID)
Baseball began testing for steroid/PEDS usage in 2003. There is a debate on wether a player that has tested positive should be elected to the hall of fame.
RESEARCH QUESTION: Will comparing stats of players that tested positive for steroids vs non-steroid player post 2003 vary enough to be usable to predict if any Hall of Fame players pre 2003 may have used steroids or performance enhancing drugs.
The cases are all major players. Total of 12717
The initial data will come from www.seanlahman.com/baseball-archive/statistics/. Additional sources will be needed to retreived to extract players that have tested postitive of steroids.
This study will be in observational study as it will take data from past baseball statistics
The data was collected from www.seanlahman.com/baseball-archive/statistics/
The dependent variable is steroid use and is qualitative
The indpendent variables are the baseball statistics. The statistics will be quantitative. Where they have already been tested positive would be a quantitative variable
Both HR and H qplots are t skewed. Comparision of H to HR are mostly right skewed, expectation is 40 Hr/250 H would be Hall of fame caliber player
glimpse(a)
## Observations: 104,324
## Variables: 4
## $ nameGiven <chr> "David Allan", "David Allan", "David Allan", "David ...
## $ yearID <int> 2004, 2006, 2007, 2008, 2009, 2010, 2012, 2013, 2015...
## $ HR <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 27, 26, 44, 30, 39, 4...
## $ H <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 131, 189, 200, 198, 196, ...
summary(a$HR)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.832 2.000 73.000
summary(a$H)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 9.00 37.01 57.00 262.00
length(unique(a$nameGiven))
## [1] 12717
qplot(a$HR)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qplot(a$H)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(a$HR)
plot(a$H~a$HR)
#dotchart(a$H,labels=a$nameGiven)