This analysis will be using the Hitters dataset from the ISLR package, a collection of MLB data from 1986-87.
hitters <- Hitters
str(hitters)
## 'data.frame': 322 obs. of 20 variables:
## $ AtBat : int 293 315 479 496 321 594 185 298 323 401 ...
## $ Hits : int 66 81 130 141 87 169 37 73 81 92 ...
## $ HmRun : int 1 7 18 20 10 4 1 0 6 17 ...
## $ Runs : int 30 24 66 65 39 74 23 24 26 49 ...
## $ RBI : int 29 38 72 78 42 51 8 24 32 66 ...
## $ Walks : int 14 39 76 37 30 35 21 7 8 65 ...
## $ Years : int 1 14 3 11 2 11 2 3 2 13 ...
## $ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ...
## $ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ...
## $ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ...
## $ CRuns : int 30 321 224 828 48 501 30 41 32 784 ...
## $ CRBI : int 29 414 266 838 46 336 9 37 34 890 ...
## $ CWalks : int 14 375 263 354 33 194 24 12 8 866 ...
## $ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
## $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
## $ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ...
## $ Assists : int 33 43 82 11 40 421 127 283 290 0 ...
## $ Errors : int 20 10 14 3 4 25 7 9 19 0 ...
## $ Salary : num NA 475 480 500 91.5 750 70 100 75 1100 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
The data includes many variables. They are mostly split into two types. The first type are those variables which record a player’s stats during the 1986 year. The second type are those variables which record career-wide stats. The career-spanning game stats are indicated by a ‘C’ at the beginning of their names, but they measure largely the same things. Times at bat (AtBat), hits, runs, home runs (HmRun), runs batted in (RBI, which to clarify includes runs that the player did not make, but were a product of his batting), and walks. There are some fielding statistics which are included, but do not have a career-wide complementary column like the batting statistics do. These include put-outs (PutOuts), assists, and errors. A put-out is given when a player’s action is directly responsible for the runner getting out, such as through tagging, catching a ball before it hits the ground, or stepping on base with the ball. An assist is an action which helps achieve that goal, such as throwing the ball to a player who gets a put-out. An error is a mistake by a fielder that allows the other team to get runs. So far, every variable discussed has been a numeric. There is also a numeric variable for annual salary (Salary), which seems to be measured in terms of thousands, and a numeric for years played by 1986. There are three factor variables. There is the league variable, which can either be A (American League) or N (National League), and the NewLeague variable, which has the same values and indicates their league at the start of 1987 rather than in 1986. There is lastly the Division factor variable, which can either be E (east) or W (west). It is best to check if the dataset has any missing values.
sum(is.na(hitters)) #Finds the sum of the rows with NaNs
## [1] 59
There are 59 rows with at least one missing variable. It is best to remove these rows so they don’t cause problems with analysis. There will still be 263 columns to work with.
hitters <- na.omit(hitters)
sum(is.na(hitters))
## [1] 0
Now that the data is cleared of missing values, the data can be visualized. However, there are a lot of variables, which make visualization of all of them at once lofty. We can get limit our visual analysis to certain key variables in order to make it more visible. We’ll focus on the yearly stats, as career stats don’t tell us much about the yearly performance of a player when we don’t adjust for years played. Fielding statistics are also probably not super valuable in this context, because they are heavily influenced by the position someone plays in the field, and require a different skillset than batting.
hitters_limited <- hitters %>% select(c(HmRun, Hits, RBI, Salary, League, Years))
hitters_limited_numeric <- hitters_limited %>% select(-League) #For correlations/plots
plot(hitters_limited_numeric)
cor(hitters_limited_numeric)
## HmRun Hits RBI Salary Years
## HmRun 1.0000000 0.53062736 0.8491074 0.3430281 0.11348842
## Hits 0.5306274 1.00000000 0.7884782 0.4386747 0.01859809
## RBI 0.8491074 0.78847819 1.0000000 0.4494571 0.12966795
## Salary 0.3430281 0.43867474 0.4494571 1.0000000 0.40065699
## Years 0.1134884 0.01859809 0.1296679 0.4006570 1.00000000
The correlation matrix and plots show the strong association between hits and RBI, as well as home runs and RBI, but interestingly not as much of an association between home runs and hits. Salary correlates strongest with RBI, and least with home runs. Players with more years under their belt tend to have better batting stats than players with less years, but the pattern is slight. It could be due to experience, but it is also possible that players who are already good tend to stay in the league longer. There are some distributions worth looking at in this data. For example, the distribution of salary appears to have its highest concentration in the lowest ranges of salary, with a long tail extending into very high salaries. The distribution for RBI and Hits appears much more normal, with right and left tails.
hist(hitters_limited_numeric$Salary, breaks=50, xlab="Salary (in thousands of $)",
main = "Salary Distribution of MLB Players (1986)")
hist(hitters_limited_numeric$RBI,breaks=25,xlab="RBI", main="1986 RBI distribution among MLB players")
hist(hitters_limited_numeric$Hits,breaks=25,xlab="RBI", main="1986 Hit distribution among MLB players")
Someone looking at baseball statistics might be interested in how the two leagues that comprise the MLB compare in batting averages. To analyze this, we can group the data by league and compare averages.
league_batting <- hitters_limited %>% group_by(League) %>% summarize(Avg_HmRn = mean(HmRun),
Avg_Hit = mean(Hits),
Avg_RBI = mean(RBI),
Avg_Sal = mean(Salary))
ggplot(data = league_batting, aes(x=League, y=Avg_RBI))+
geom_bar(stat="identity")+
labs(y="Average RBI", title="Average RBI by League in MLB")
ggplot(data = league_batting, aes(x=League, y=Avg_Sal))+
geom_bar(stat="identity")+
labs(y="Average Salary", title="Average Salary by League in MLB")
ggplot(data = league_batting, aes(x=League, y=Avg_Hit))+
geom_bar(stat="identity")+
labs(y="Average Hits", title="Average Hits by League in MLB")
ggplot(data = league_batting, aes(x=League, y=Avg_HmRn))+
geom_bar(stat="identity")+
labs(y="Average Home Runs", title="Average Home Runs by League in MLB")
The American League was superior to the National League in 1986 with
respect to the selected batting stats. The American League has a
marginally higher average salary, but salary is the most equal of the
variables across leagues judging by the bar plot.
Now that the data has been visualized to a good degree, we can use the trends we’ve seen to design a model for our data. The dependent variable should probably be salary, because it is generally assumed that a player’s salary is determined by their skill across many statistics. It is worth knowing what the market desires in a player as well. We can start off with a general linear model with all of our variables.
mod <- lm(Salary~.,data=hitters)
summary(mod)
##
## Call:
## lm(formula = Salary ~ ., data = hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -907.62 -178.35 -31.11 139.09 1877.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 163.10359 90.77854 1.797 0.073622 .
## AtBat -1.97987 0.63398 -3.123 0.002008 **
## Hits 7.50077 2.37753 3.155 0.001808 **
## HmRun 4.33088 6.20145 0.698 0.485616
## Runs -2.37621 2.98076 -0.797 0.426122
## RBI -1.04496 2.60088 -0.402 0.688204
## Walks 6.23129 1.82850 3.408 0.000766 ***
## Years -3.48905 12.41219 -0.281 0.778874
## CAtBat -0.17134 0.13524 -1.267 0.206380
## CHits 0.13399 0.67455 0.199 0.842713
## CHmRun -0.17286 1.61724 -0.107 0.914967
## CRuns 1.45430 0.75046 1.938 0.053795 .
## CRBI 0.80771 0.69262 1.166 0.244691
## CWalks -0.81157 0.32808 -2.474 0.014057 *
## LeagueN 62.59942 79.26140 0.790 0.430424
## DivisionW -116.84925 40.36695 -2.895 0.004141 **
## PutOuts 0.28189 0.07744 3.640 0.000333 ***
## Assists 0.37107 0.22120 1.678 0.094723 .
## Errors -3.36076 4.39163 -0.765 0.444857
## NewLeagueN -24.76233 79.00263 -0.313 0.754218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 315.6 on 243 degrees of freedom
## Multiple R-squared: 0.5461, Adjusted R-squared: 0.5106
## F-statistic: 15.39 on 19 and 243 DF, p-value: < 2.2e-16
The correlation is moderate, but not particularly good, at ~0.55. With this we can create a more specific model using only the statistically significant variables. Adding one of the near-significant (p<0.1) variables (CRuns) helped reduce the loss of correlation in the smaller model, and CRuns became significant in the new model. However, having both CRuns and Assists only marginally improved the fit, and Assists was not statistically significant in the final model, so I decided only to include CRuns.
mod2 <- lm(Salary~PutOuts+Division+CWalks+Walks+Hits+AtBat+CRuns,data=hitters)
summary(mod2)
##
## Call:
## lm(formula = Salary ~ PutOuts + Division + CWalks + Walks + Hits +
## AtBat + CRuns, data = hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -814.05 -168.10 -30.04 128.00 2026.92
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105.64875 65.80221 1.606 0.109610
## PutOuts 0.30288 0.07464 4.058 6.58e-05 ***
## DivisionW -116.16922 39.71346 -2.925 0.003753 **
## CWalks -0.71633 0.26545 -2.699 0.007429 **
## Walks 6.05587 1.53602 3.943 0.000104 ***
## Hits 6.75749 1.67372 4.037 7.15e-05 ***
## AtBat -1.97628 0.52935 -3.733 0.000233 ***
## CRuns 1.12931 0.19950 5.661 4.05e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 318.9 on 255 degrees of freedom
## Multiple R-squared: 0.5136, Adjusted R-squared: 0.5003
## F-statistic: 38.47 on 7 and 255 DF, p-value: < 2.2e-16
plot(mod2)
The residuals appear to show slight signs of nonlinearity and
heteroscedasticity. There are also some outliers. After doing some trial
and error, the best solution seemed to be transforming the response
variable using the square root function. Using square root on the
response and numeric predictors also reduced linearity and came out with
a slightly higher correlation, but did not solve the heteroscedasticity
issue as well. Outliers were also pruned.
hitters <- hitters %>% filter(!(row.names(hitters) %in% c("-Mike Schmidt", "-Pete Rose", "-Steve Sax")))
mod3 <- lm(sqrt(Salary)~PutOuts+Division+CWalks+Walks+Hits+AtBat+CRuns,data=hitters)
#mod3 <- lm(sqrt(Salary)~sqrt(PutOuts)+Division+sqrt(CWalks)+sqrt(Walks)+sqrt(Hits)+sqrt(AtBat)+sqrt(CRuns),data=hitters)
plot(mod3)
hist(mod3$residuals)
summary(mod3)
##
## Call:
## lm(formula = sqrt(Salary) ~ PutOuts + Division + CWalks + Walks +
## Hits + AtBat + CRuns, data = hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3019 -3.5287 -0.3366 3.4741 18.1739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.881143 1.181919 8.360 4.28e-15 ***
## PutOuts 0.005150 0.001322 3.897 0.000125 ***
## DivisionW -1.319860 0.705478 -1.871 0.062522 .
## CWalks -0.011071 0.004682 -2.365 0.018800 *
## Walks 0.098050 0.027106 3.617 0.000360 ***
## Hits 0.139520 0.030089 4.637 5.68e-06 ***
## AtBat -0.034682 0.009441 -3.673 0.000292 ***
## CRuns 0.023259 0.003558 6.537 3.46e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.616 on 252 degrees of freedom
## Multiple R-squared: 0.6206, Adjusted R-squared: 0.6101
## F-statistic: 58.9 on 7 and 252 DF, p-value: < 2.2e-16
The new model has a correlation of ~0.62, a decent increase from the original model, while also showing promising normality, linearity, and homoscedasticity.
The visualization and modeling of the dataset has revealed and suggested many interesting trends. I think the most interesting takeaway was the one-tailed distribution of salaries. I think it is a sign that baseball follows the Pareto principle – a small minority of the players account for a disproportionately high amount of necessary value on a team, which is reflected in their salaries. The second thing I found interesting was the relatively low significance of Home Runs. I would think that they would be very predictive of salary, but I think it’s possible that the reason walks were more significant in the initial model is that pitchers tend to be much more hesitant to throw in the strike zone when a good batter steps up to plate. This results in more balls, and subsequently more walks. Because this dataset is rather old, and has only a small portion of the massive amount of baseball statistics as variables, it is worth treating its results with a grain of salt. Certain findings, especially trends between leagues and divisions, may no longer be the case. The way players are valued and the way people play is very different from 1986. Certainly, the way salaries are decided in the post-Moneyball era is probably quite different from the way they were decided in the 1980s.