The goal of this assignment is to identify the relationship between higher salaries (numerical, continuous) of professional MLB baseball players and the position they play on the field (categorical, no order). I have chosen Lahman’s Baseball Database. Two variables of interest have been selected in ordor to explore the relationship between them. The units of observation or cases for this retrospective observational study is the $pos variable, an unordered categorical variable.
SELECT CONCAT(m.nameFirst, ‘’, m.nameLast) AS playerName, m.weight, m.height, m.birthCountry, m.birthCity, m.birthState, m.college, s.salary, f.pos, f.G FROM Master m JOIN Salaries s ON (s.playerID = m.playerID) JOIN Fielding f ON (f.playerID = m.playerID) WHERE f.yearID > 1980 AND f.teamID = (“SFN” ) GROUP BY m.playerID ORDER BY m.college DESC
The Lahman Baseball Database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2013. It includes data from the two current leagues (American and National), the four other “major” leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875. The Database was created by Sean Lahman, who pioneered the effort to make baseball statistics freely available to the general public and is the largest and most accurate source for baseball statistics available anywhere.
# Data Summary
dim(SFG_data)
## [1] 450 10
summary(SFG_data$salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60000 109000 170000 254000 350000 2030000
summary(SFG_data$pos)
## 1B 2B 3B C CF DH LF OF P RF SS
## 57 50 12 23 54 5 13 10 211 2 13
This is an observational study to determine if differences in the salaries of major league baseball players are related to their position. Of the 30 major league teams since the least expansion in 1980, I have selected the San Francisco Giants team from the National League “NL”.
plot(SFG_data$pos)
plot(SFG_data$salary ~ SFG_data$pos)
plot(SFG_data$G ~ SFG_data$pos)
plot(SFG_data$height ~ SFG_data$pos)
plot(SFG_data$weight ~ SFG_data$pos)
plot(SFG_data$salary~SFG_data$pos,xlab="MLB Giants Salary",
ylab="Postion", main="San Francisco Giants!!!!")
Becasuse this is not a randomised study and each team may have there own philosophy and market no general inference about MLB can be made yet. Equally the players and owners associations have set some limits to minimum salaries, each team conducts its own strategy for recruitment from colleges and amateur baseball leagues in the US and internationally. These relationships may confound the ability to generalize beyond the team, region or division. With further study a comaprision of samplings from National League vs American League (that uses a designated hitter “DH”) as well as distinguishing large markets such as Los Angeles or New York from smaller markets such as Seattle or Oakland. It is also important to note that for San Francisco data that DH has been recorded for limited games that play against the American League. The numbers are to few to rely upon. That is also the case for Right field “RF”. Knowing these limitations the question asked here is meets the ANOVA conditions for the Sf Giants team: 1. Independence: within postion groups : sampled observations must be independent between groups: the groups must be independent of each other (non-paired) 2. Approximate normality: distributions should be nearly normal within each group, although a number are right skewed reducing the reliability of the analysis. 3. Equal variance: groups have roughly equal variability. This condidtion is also a bit forced, but for the sake of the assignment we will allow rough to be very rough. Given these qualifications we ask if the data provides convincing evidence that one pair of position means are different from each other, the observed differences in sample means are attributable to sampling variability (or chance).
fit <- aov(SFG_data$salary ~ SFG_data$pos, data=SFG_data)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## SFG_data$pos 10 6.95e+11 6.95e+10 1.21 0.28
## Residuals 439 2.53e+13 5.76e+10
drop1(fit,~.,test="F")
## Single term deletions
##
## Model:
## SFG_data$salary ~ SFG_data$pos
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 2.53e+13 11160
## SFG_data$pos 10 6.95e+11 2.60e+13 11153 1.21 0.28
Based on this analysis we can see that p-value is large, fail to reject H0. The data do not provide convincing evidence that one pair of population means are different from each other, the observed differences in sample means are attributable to sampling variability (or chance).
The updated version of the database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. The Lahman database is copyright 1994-2014 by Sean Lahman, and is used by SABR101x under the Creative Commons BYSA 3.0 license. Much of the raw data contained in this database comes from the work of Pete Palmer, who contributes to baseball encylopedias published since 1974. He is largely responsible for computerizing the batting, pitching, and fielding data. See http://sabr.org/cmsfiles/PalmerDatabaseHistory.pdf The database is copyright 1996-2014 by Sean Lahman under a Creative Commons Attribution-ShareAlike 3.0 Unported License and is used by SABR101x under the Creative Commons BYSA 3.0 license. BU have slightly modified the database, adding a few indexes. For details see: http://creativecommons.org/licenses/by-sa/3.0/.
head(SFG_data)
## playerName weight height birthCountry birthCity birthState
## 1 Brad Hennessey 195 74 USA Toledo OH
## 2 Dave Dravecky 195 73 USA Youngstown OH
## 3 Jack Taschner 205 75 USA Milwaukee WI
## 4 Alvin Morman 210 75 USA Rockingham NC
## 5 Chris Ray 210 75 USA Tampa FL
## 6 Doug Mirabelli 205 73 USA Kingman AZ
## college salary pos G
## 1 YSU 400000 P 7
## 2 Youngstown State 240000 P 18
## 3 Wisconsin-Oshkosh 330000 P 24
## 4 Wingate 109000 P 9
## 5 William & Mary 335000 P 28
## 6 Wichita State 109000 C 8
Alternative linear models below reinforce the limits of significants for any inferences drawn with this data.
SFG_fit <- lm(salary ~ pos , SFG_data)
summary(SFG_fit)
##
## Call:
## lm(formula = salary ~ pos, data = SFG_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -398000 -148470 -82123 97354 1749796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 278775 31785 8.77 <2e-16 ***
## pos2B -56208 46498 -1.21 0.227
## pos3B -38609 76218 -0.51 0.613
## posC -87058 59280 -1.47 0.143
## posCF -62361 45571 -1.37 0.172
## posDH 209225 111927 1.87 0.062 .
## posLF 9396 73757 0.13 0.899
## posOF 96466 82274 1.17 0.242
## posP -21305 35822 -0.59 0.552
## posRF 19225 172638 0.11 0.911
## posSS -30737 73757 -0.42 0.677
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 240000 on 439 degrees of freedom
## Multiple R-squared: 0.0268, Adjusted R-squared: 0.00458
## F-statistic: 1.21 on 10 and 439 DF, p-value: 0.284
SFG_nfit <- lm(salary ~ pos - 1, SFG_data)
summary(SFG_nfit)
##
## Call:
## lm(formula = salary ~ pos - 1, data = SFG_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -398000 -148470 -82123 97354 1749796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## pos1B 278775 31785 8.77 < 2e-16 ***
## pos2B 222568 33937 6.56 1.5e-10 ***
## pos3B 240167 69274 3.47 0.00058 ***
## posC 191717 50038 3.83 0.00015 ***
## posCF 216414 32656 6.63 1.0e-10 ***
## posDH 488000 107319 4.55 7.0e-06 ***
## posLF 288172 66557 4.33 1.9e-05 ***
## posOF 375242 75886 4.94 1.1e-06 ***
## posP 257470 16520 15.58 < 2e-16 ***
## posRF 298000 169687 1.76 0.07976 .
## posSS 248038 66557 3.73 0.00022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 240000 on 439 degrees of freedom
## Multiple R-squared: 0.539, Adjusted R-squared: 0.528
## F-statistic: 46.8 on 11 and 439 DF, p-value: <2e-16
SFG_baseC <- relevel(SFG_data$pos,"C")
SFG_baseC_fit <- lm(SFG_data$salary ~ SFG_data$pos, SFG_baseC)
summary(SFG_baseC_fit)
##
## Call:
## lm(formula = SFG_data$salary ~ SFG_data$pos, data = SFG_baseC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -398000 -148470 -82123 97354 1749796
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 278775 31785 8.77 <2e-16 ***
## SFG_data$pos2B -56208 46498 -1.21 0.227
## SFG_data$pos3B -38609 76218 -0.51 0.613
## SFG_data$posC -87058 59280 -1.47 0.143
## SFG_data$posCF -62361 45571 -1.37 0.172
## SFG_data$posDH 209225 111927 1.87 0.062 .
## SFG_data$posLF 9396 73757 0.13 0.899
## SFG_data$posOF 96466 82274 1.17 0.242
## SFG_data$posP -21305 35822 -0.59 0.552
## SFG_data$posRF 19225 172638 0.11 0.911
## SFG_data$posSS -30737 73757 -0.42 0.677
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 240000 on 439 degrees of freedom
## Multiple R-squared: 0.0268, Adjusted R-squared: 0.00458
## F-statistic: 1.21 on 10 and 439 DF, p-value: 0.284