Introduction:

The goal of this assignment is to identify the relationship between higher salaries (numerical, continuous) of professional MLB baseball players and the position they play on the field (categorical, no order). I have chosen Lahman’s Baseball Database. Two variables of interest have been selected in ordor to explore the relationship between them. The units of observation or cases for this retrospective observational study is the $pos variable, an unordered categorical variable.

The SQL Sandbox query used is as follows for selected teams:

SELECT CONCAT(m.nameFirst, ‘’, m.nameLast) AS playerName, m.weight, m.height, m.birthCountry, m.birthCity, m.birthState, m.college, s.salary, f.pos, f.G FROM Master m JOIN Salaries s ON (s.playerID = m.playerID) JOIN Fielding f ON (f.playerID = m.playerID) WHERE f.yearID > 1980 AND f.teamID = (“SFN” ) GROUP BY m.playerID ORDER BY m.college DESC

Data:

The Lahman Baseball Database contains pitching, hitting, and fielding statistics for Major League Baseball from 1871 through 2013. It includes data from the two current leagues (American and National), the four other “major” leagues (American Association, Union Association, Players League, and Federal League), and the National Association of 1871-1875. The Database was created by Sean Lahman, who pioneered the effort to make baseball statistics freely available to the general public and is the largest and most accurate source for baseball statistics available anywhere.

# Data Summary
dim(SFG_data)
## [1] 450  10
summary(SFG_data$salary)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   60000  109000  170000  254000  350000 2030000
summary(SFG_data$pos)
##  1B  2B  3B   C  CF  DH  LF  OF   P  RF  SS 
##  57  50  12  23  54   5  13  10 211   2  13

Exploratory data analysis:

This is an observational study to determine if differences in the salaries of major league baseball players are related to their position. Of the 30 major league teams since the least expansion in 1980, I have selected the San Francisco Giants team from the National League “NL”.

plot(SFG_data$pos)

plot of chunk unnamed-chunk-3

plot(SFG_data$salary ~ SFG_data$pos)

plot of chunk unnamed-chunk-3

plot(SFG_data$G ~ SFG_data$pos)

plot of chunk unnamed-chunk-3

plot(SFG_data$height ~ SFG_data$pos)

plot of chunk unnamed-chunk-3

plot(SFG_data$weight ~ SFG_data$pos)

plot of chunk unnamed-chunk-3

plot(SFG_data$salary~SFG_data$pos,xlab="MLB Giants Salary",
  ylab="Postion", main="San Francisco Giants!!!!")

plot of chunk unnamed-chunk-3

Inference:

Becasuse this is not a randomised study and each team may have there own philosophy and market no general inference about MLB can be made yet. Equally the players and owners associations have set some limits to minimum salaries, each team conducts its own strategy for recruitment from colleges and amateur baseball leagues in the US and internationally. These relationships may confound the ability to generalize beyond the team, region or division. With further study a comaprision of samplings from National League vs American League (that uses a designated hitter “DH”) as well as distinguishing large markets such as Los Angeles or New York from smaller markets such as Seattle or Oakland. It is also important to note that for San Francisco data that DH has been recorded for limited games that play against the American League. The numbers are to few to rely upon. That is also the case for Right field “RF”. Knowing these limitations the question asked here is meets the ANOVA conditions for the Sf Giants team: 1. Independence: within postion groups : sampled observations must be independent between groups: the groups must be independent of each other (non-paired) 2. Approximate normality: distributions should be nearly normal within each group, although a number are right skewed reducing the reliability of the analysis. 3. Equal variance: groups have roughly equal variability. This condidtion is also a bit forced, but for the sake of the assignment we will allow rough to be very rough. Given these qualifications we ask if the data provides convincing evidence that one pair of position means are different from each other, the observed differences in sample means are attributable to sampling variability (or chance).

fit <- aov(SFG_data$salary ~ SFG_data$pos, data=SFG_data)
summary(fit)
##               Df   Sum Sq  Mean Sq F value Pr(>F)
## SFG_data$pos  10 6.95e+11 6.95e+10    1.21   0.28
## Residuals    439 2.53e+13 5.76e+10
drop1(fit,~.,test="F") 
## Single term deletions
## 
## Model:
## SFG_data$salary ~ SFG_data$pos
##              Df Sum of Sq      RSS   AIC F value Pr(>F)
## <none>                    2.53e+13 11160               
## SFG_data$pos 10  6.95e+11 2.60e+13 11153    1.21   0.28

Conclusion:

Based on this analysis we can see that p-value is large, fail to reject H0. The data do not provide convincing evidence that one pair of population means are different from each other, the observed differences in sample means are attributable to sampling variability (or chance).

References:

The updated version of the database contains complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more. The Lahman database is copyright 1994-2014 by Sean Lahman, and is used by SABR101x under the Creative Commons BYSA 3.0 license. Much of the raw data contained in this database comes from the work of Pete Palmer, who contributes to baseball encylopedias published since 1974. He is largely responsible for computerizing the batting, pitching, and fielding data. See http://sabr.org/cmsfiles/PalmerDatabaseHistory.pdf The database is copyright 1996-2014 by Sean Lahman under a Creative Commons Attribution-ShareAlike 3.0 Unported License and is used by SABR101x under the Creative Commons BYSA 3.0 license. BU have slightly modified the database, adding a few indexes. For details see: http://creativecommons.org/licenses/by-sa/3.0/.

Appendix:

head(SFG_data)
##       playerName weight height birthCountry  birthCity birthState
## 1 Brad Hennessey    195     74          USA     Toledo         OH
## 2  Dave Dravecky    195     73          USA Youngstown         OH
## 3  Jack Taschner    205     75          USA  Milwaukee         WI
## 4   Alvin Morman    210     75          USA Rockingham         NC
## 5      Chris Ray    210     75          USA      Tampa         FL
## 6 Doug Mirabelli    205     73          USA    Kingman         AZ
##             college salary pos  G
## 1               YSU 400000   P  7
## 2  Youngstown State 240000   P 18
## 3 Wisconsin-Oshkosh 330000   P 24
## 4           Wingate 109000   P  9
## 5    William & Mary 335000   P 28
## 6     Wichita State 109000   C  8

Alternative linear models below reinforce the limits of significants for any inferences drawn with this data.

SFG_fit <- lm(salary ~ pos , SFG_data)
summary(SFG_fit)
## 
## Call:
## lm(formula = salary ~ pos, data = SFG_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -398000 -148470  -82123   97354 1749796 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   278775      31785    8.77   <2e-16 ***
## pos2B         -56208      46498   -1.21    0.227    
## pos3B         -38609      76218   -0.51    0.613    
## posC          -87058      59280   -1.47    0.143    
## posCF         -62361      45571   -1.37    0.172    
## posDH         209225     111927    1.87    0.062 .  
## posLF           9396      73757    0.13    0.899    
## posOF          96466      82274    1.17    0.242    
## posP          -21305      35822   -0.59    0.552    
## posRF          19225     172638    0.11    0.911    
## posSS         -30737      73757   -0.42    0.677    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 240000 on 439 degrees of freedom
## Multiple R-squared:  0.0268, Adjusted R-squared:  0.00458 
## F-statistic: 1.21 on 10 and 439 DF,  p-value: 0.284
SFG_nfit <- lm(salary ~ pos - 1, SFG_data)
summary(SFG_nfit)
## 
## Call:
## lm(formula = salary ~ pos - 1, data = SFG_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -398000 -148470  -82123   97354 1749796 
## 
## Coefficients:
##       Estimate Std. Error t value Pr(>|t|)    
## pos1B   278775      31785    8.77  < 2e-16 ***
## pos2B   222568      33937    6.56  1.5e-10 ***
## pos3B   240167      69274    3.47  0.00058 ***
## posC    191717      50038    3.83  0.00015 ***
## posCF   216414      32656    6.63  1.0e-10 ***
## posDH   488000     107319    4.55  7.0e-06 ***
## posLF   288172      66557    4.33  1.9e-05 ***
## posOF   375242      75886    4.94  1.1e-06 ***
## posP    257470      16520   15.58  < 2e-16 ***
## posRF   298000     169687    1.76  0.07976 .  
## posSS   248038      66557    3.73  0.00022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 240000 on 439 degrees of freedom
## Multiple R-squared:  0.539,  Adjusted R-squared:  0.528 
## F-statistic: 46.8 on 11 and 439 DF,  p-value: <2e-16
SFG_baseC <- relevel(SFG_data$pos,"C")
SFG_baseC_fit <- lm(SFG_data$salary ~ SFG_data$pos, SFG_baseC)
summary(SFG_baseC_fit)
## 
## Call:
## lm(formula = SFG_data$salary ~ SFG_data$pos, data = SFG_baseC)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -398000 -148470  -82123   97354 1749796 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      278775      31785    8.77   <2e-16 ***
## SFG_data$pos2B   -56208      46498   -1.21    0.227    
## SFG_data$pos3B   -38609      76218   -0.51    0.613    
## SFG_data$posC    -87058      59280   -1.47    0.143    
## SFG_data$posCF   -62361      45571   -1.37    0.172    
## SFG_data$posDH   209225     111927    1.87    0.062 .  
## SFG_data$posLF     9396      73757    0.13    0.899    
## SFG_data$posOF    96466      82274    1.17    0.242    
## SFG_data$posP    -21305      35822   -0.59    0.552    
## SFG_data$posRF    19225     172638    0.11    0.911    
## SFG_data$posSS   -30737      73757   -0.42    0.677    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 240000 on 439 degrees of freedom
## Multiple R-squared:  0.0268, Adjusted R-squared:  0.00458 
## F-statistic: 1.21 on 10 and 439 DF,  p-value: 0.284