DASI Project
For me, the surest sign of spring is the return of baseball–an American pastime that stretches throughout the warm days of summer. In addition to enjoying televised games and attending local games, I participate in a fantasy league. This is an on-line, recreational game in which various members of my family choose players throughout the league and compare statistics of players in a categorical, head-to-head competition.
My family plays for fun. However, many Americans put more stake in fantasy games. A 2012 article on MSN Money estimated the number of participants of fantasy baseball at 12.2 million. Large media sports networks such as CBS Sports, ESPN, Yahoo Sports, USA Today, and NBC Sports host fantasy leagues for various sports (American football, baseball, basketball, golf, hockey, and racing). In all, MSN Money projects over $2 billion dollars in yearly “economic impact”. There are leagues that people may pay to join in order to win cash prices (some payouts in excess of thousands of dollars).
Personally, I am content just to beat my sons in the fantasy standings. There are 20 offensive categories that each team must win in a weekly match up. Three of these categories (15%) comprise baseball players “stealing bases.” I am wondering, if I select (i.e., draft) players for my team who are very fast, will they lower the overall team offensive production of home runs, RBIs, batting average, etc? Therefore, my research question is:
The baseball data I collected comes from www.rotochamp.com. This is a fantasy baseball site for statistics of past, current, and future (projections) of MLB (Major League Baseball) data. I had to scrape several web pages for projections of fielding position players (those who create offensive statistics). I gathered the 2014 “composite” projections into a csv file (see attached page). This citation is for one subset of the data (Position = Out_Fielders) http://www.rotochamp.com/baseball/PlayerRankings.aspx?Position=OF.
Each case or observation has the following form:
names(baseball2014)
## [1] "Pos" "PosRank" "Player" "Team" "AB" "R" "HR"
## [8] "RBI" "SB" "AVG" "OBP" "SLG" "Value" "SBAB"
## [15] "SPD" "OPI"
This table summarize the relevant (not all) variable for this project:
| Variable | Description | Type |
| Player | Player’s Name | Categorical |
| AB | Number of “at bats” or opportunities | Numeric-Discrete |
| R | Number of runs scored | Numeric-Discrete |
| HR | Number of home runs | Numeric-Discrete |
| RBI | Number of runs batted in | Numeric-Discrete |
| SB | Number of stolen bases | Numeric-Discrete |
| AVG | Ratio of hits to at bats | Numeric-Continuous |
| OBP | Ratio of reaching first base per at bats | Numeric-Continuous |
| SLG | Ratio composite of number of bases times hits per at bats | Numeric-Continuous |
In order to complete this project, I had to create three additional variables: SBAB, SPD, and OPI.
SBAB is a continuous, numerical ratio of SB / AB (stolen base count to “at bats” or opportunities).
SPD (Speed rating) is a categorical rating based on SBAB quantiles:
| 0 - 20th | Very Slow |
| 21 -40th | Slow |
| 41 - 60th | Average |
| 61 - 80th | Fast |
| 81 - 99th | Very Fast |
OPI (Offensive Production Index) is a composite index of offensive stats times efficiency of chance that I created from the rotochamp data.
OPI = sqrt((R + HR + RBI) * (sqrt (AB) * (AVG + OBP + SLG) / 3))
The square root transformations shape the data into nearly a normal distribution.
One complete observation looks like:
baseball2014[5, ]
## Pos PosRank Player Team AB R HR RBI SB AVG OBP SLG Value
## 5 1B 13 Albert Pujols LAA 484 75 24 83 4 0.277 0.348 0.483 $19.00
## SBAB SPD OPI
## 5 0.008264 Slow 38.46
This is an observational study. There is no experimental design or treatment. The data is based on statistical projections for actual professional baseball players. Since the data is based upon observed (though, projected) data, the project cannot show causality, only an association. Moreover, the sampling (using inference function) will be generalizable to the entire data set of professional baseball players. The inference function will randomly select observations from each of the 5 categories to compare their respective OPI means.
For starters, I looked at the relationship between speed and each offensive variable (SLG, HR, RBI, and AVG. For brevity, I have included just the summary boxplots below.
This is a purely speculative examination. For the most part, each fantasy team consists of the best players in the league. Team owners look to pick up the top players at each position. Since there are 30 clubs in MLB and the fantasy league consists of 8 teams, I am going to quickly investigate how the top 25% of each offensive category would be modeled as proportions of each speed category. Again, this is merely a rough estimation as how offensive stats are related to speed and will help me formulate a hypothesis later.
## Very Slow Slow Average Fast Very Fast
## HR 26 31 26 21 17
## SLG 20 33 27 19 17
## AVG 14 21 25 30 32
## RBI 21 29 30 23 14
From the above exploration, I can see that each offensive stat has a different relationship with speed. Batting average appears to be the least affected by speed; whereas, HR and SLG are inversely related to speed. Although these data relationships are interesting, they are too fragmented to make a clear and concise statement about the relationship of speed with offense. In order to tell the overall impact of speed on offensive production, I am going to need the OPI to SPD analysis of variance below.
Since I have one numeric and one categorical variable, I am going to use ANOVA with the inference function.
My hypothesis for the ANOVA is:
H0 = mean(SPD = “Very Slow”) = mean(SPD = “Slow”) = mean(SPD = “Average”) = mean(SPD = “Fast”) = mean(SPD = “Very Fast”)
HA= at least two categorical means are different from each other
################ Inference Function for ANOVA of means: ##############
inference(y = baseball2014$OPI, x = baseball2014$SPD, est = "mean", type = "ht",
alternative = "greater", method = "theoretical", eda_plot = FALSE)
## Response variable: numerical, Explanatory variable: categorical
## ANOVA
## Summary statistics:
## n_Average = 91, mean_Average = 23.68, sd_Average = 11.16
## n_Fast = 89, mean_Fast = 24.2, sd_Fast = 9.995
## n_Slow = 94, mean_Slow = 24.4, sd_Slow = 10.22
## n_Very Fast = 87, mean_Very Fast = 25.26, sd_Very Fast = 10.24
## n_Very Slow = 84, mean_Very Slow = 21.43, sd_Very Slow = 10.97
## H_0: All means are equal.
## H_A: At least one mean is different.
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 4 704 176 1.59 0.18
## Residuals 440 48731 111
boxplot(baseball2014$OPI ~ baseball2014$SPD, main = "OPI by Speed Rating", col = "#33669925",
at = c(0, 1, -1, 2, -2))
abline(h = mean(baseball2014$OPI, na.rm = T), col = "blue", lty = 2, lwd = 2)
Conditions for ANOVA and hypothesis testing: From the OPI by Speed Rating plot above, I can tell the conditions for ANOVA were met. There is independence within each group and between each group. Each baseball player is independent of another and is only listed in one SPD category. Furthermore, each boxplot shows approximate normality and equal variance for each group.
From the inference function and ANOVA analysis, the F value (1.59) is small and the p-value (0.18) is larger than a small significant value. Therefore, I cannot reject the null hypothesis. In other words, the means across the SPD (speed rating) categories are not significantly different.
In the end, it seems much to do about nothing. The p-value of .18 > .05 tells me that I cannot dismiss the hypothesis that states that the mean OPIs across the SPD categories are the same. A look at the side-by-side boxplots shows how similar each group is. As far as my fantasy draft is concerned, I should be able to find speedy players who will not hurt my overall offensive production. That said, I should probably start to worry about my pitching now…