download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
Traditional Variables:
at_batshitshomerunsbat_avgstrikeoutsstolen_baseswinsNew Variables:
new_onbasenew_slugnew_obsIn general, when analyzing a relationship we should follow the steps below:
Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?
plot(mlb11$bat_avg, mlb11$runs, xlab="bat_avg", ylab="runs", main="Assessing The Correlation Between Batting Averages And Runs")
If we check batting averages, there seems to be a linear relationship. # Question 2
How does this relationship compare to the relationship between runs and at_bats? Use the R\(^2\) values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?
batavgruns <- lm(runs ~ bat_avg, data=mlb11)
summary(batavgruns)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
atbatruns <- lm(runs ~ at_bats, data=mlb11)
summary(atbatruns)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
The Multiple r-squared value for batavgruns, or the model using the correlation between batting averages and runs, is .6561 which is a much stronger than the multiple r-squared value for atbatruns, or the model using the correlation between at bats and runs, which is .3729. This means that batting averages explain 65.61% of runs whereas at bats explain only 37.29% of runs. Batting averages therefore predict runs significantly better than at bats. #Question 3
Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods (See Steps 1 through 4 at the beginning of the document) we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five). Check the model diagnostics for the model you select.
Note: You should include all your code here, but comment out the code that you don’t need by placing a # in front of the code line so it is still visible but doesn’t run.
# atbatruns <- lm(runs~at_bats, data=mlb11)
# summary(atbatruns)
# hitruns <- lm(runs~hits, data=mlb11)
# summary(hitruns)
# HRruns <- lm(runs~homeruns, data=mlb11)
# summary(HRruns)
# SOruns <- lm(runs~strikeouts, data=mlb11)
# summary(SOruns)
# SBruns <- lm(runs~stolen_bases, data=mlb11)
# summary(SBruns)
# winruns <- lm(runs~wins, data=mlb11)
# summary(winruns)
plot(mlb11$bat_avg,mlb11$runs,xlab="bat_avg",ylab="runs",main="Assessing The Correlation Between Batting Averages and Runs")
abline(batavgruns)
summary(batavgruns)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
The summary demonstrates that the runs and the batting averages have a multiple r squared value of .6561. The graphical evidence demonstrates a clear linear relationship as well.
Model Diagnostics (residual analysis) to check assumptions for the model we selected above:
hist(batavgruns$residuals)
The residuals as shown in the histogram look approximately normal, and the constant variability condition appears to have been met. # Question 4
Now examine the three newer variables. These are the statistics used by the author of Moneyball* to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence (See steps 1 through 4 in the beginning of the document). Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?*
onbaseruns <- lm(new_onbase~runs, data=mlb11)
summary(onbaseruns)
##
## Call:
## lm(formula = new_onbase ~ runs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0108850 -0.0034976 -0.0004294 0.0023316 0.0108692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.163e-01 8.354e-03 25.89 < 2e-16 ***
## runs 1.502e-04 1.196e-05 12.55 5.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.005314 on 28 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8437
## F-statistic: 157.6 on 1 and 28 DF, p-value: 5.116e-13
plot(mlb11$new_onbase,mlb11$runs,xlab="new_onbase",ylab="runs",main="Assessing The Correlation Between On-Base Percentage and Runs")
abline(onbaseruns)
slugruns <- lm(new_slug~runs, data=mlb11)
summary(slugruns)
##
## Call:
## lm(formula = new_slug ~ runs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.016184 -0.005936 0.001474 0.006825 0.017002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.668e-01 1.497e-02 11.14 8.34e-12 ***
## runs 3.345e-04 2.144e-05 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.009521 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15
plot(mlb11$new_slug,mlb11$runs,xlab="new_slug",ylab="runs",main="Assessing The Correlation Between Slug Percentage and Runs")
abline(slugruns)
obsruns <- lm(new_obs~runs, data=mlb11)
summary(obsruns)
##
## Call:
## lm(formula = new_obs ~ runs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.017098 -0.009121 0.001791 0.006924 0.020315
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.812e-01 1.696e-02 22.48 <2e-16 ***
## runs 4.871e-04 2.429e-05 20.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01079 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
plot(mlb11$new_obs,mlb11$runs,xlab="new_obs",ylab="runs",main="Assessing The Correlation Between OBS and Runs")
abline(obsruns)
Firstly, the new metrics are clearly superior to the old metrics in terms of predicting runs. Whereas before, the best old metric had a rough multiple r squared value of .6561, here you have values of .8969, 8491, and .9349 respectively. They also have strong visual correlations as seen in the graphs. OBS seems to be the best predictor of runs at this point. I know very little about baseball or the metrics utilized in this dataset so I can’t speak to whether it makes any sense in terms of anything but the clear statistical link as demonstrated through this data. # Question 5
Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.
hist(obsruns$residuals)
The residuals as shown in the histogram look approximately normal, and the constant variability condition appears to have been met.