Part 1
The movie “Moneyball” focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.
In this exercise we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.
Let’s load up the data for the 2011 season (and load up
mosaic while we’re at it!).
In addition to runs scored, there are seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables. At the end of the lab, you’ll work with the newer variables on your own.
runs and one of the other numerical variables? Plot this
relationship using the variable at_bats as the predictor.
Does the relationship look linear? If you knew a team’s
at_bats, would you be comfortable using a linear model to
predict the number of runs?SOLUTION:
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
SOLUTION:
## [1] 0.610627
Think back to the way that we described the distribution of a single
variable. Recall that we discussed characteristics such as center,
spread, and shape. It’s also useful to be able to describe the
relationship between two quantitative variables, such as
runs and at_bats above.
SOLUTION: There seems to be a weak positive linear relationship between the two variables.
Just as we used the mean and standard deviation to summarize a single variable, we can summarize the relationship between these two variables by finding the line that best represents their association. Use the following interactive function to select the line that you think does the best job of going through the cloud of points.
After running this command, you’ll be prompted to click two points on the plot to define a line. Once you’ve done that, the line you specified will be shown in black and the residuals in blue. What are residuals?
The most common way to do linear regression is to select the line that minimizes the sum of squared residuals.
plot_ss, choose a line that does a good job of
minimizing the sum of squares. Run the function several times. What was
the smallest sum of squares that you got? How does it compare to your
neighbors?YOU CAN SKIP THIS PROBLEM
It is rather cumbersome to try to get the correct least squares line,
i.e. the line that minimizes the sum of squared residuals, through trial
and error. Instead we can use the lm function in R to fit
the linear model (a.k.a. regression line).
The output of lm is an object that contains all of the
information we need about the linear model that was just fit. We can
access this information using the summary() function.
With this table, what is the least squares regression line?
SOLUTION:
\[\widehat{runs} = -2789.24 + 0.63 at\_bats\]
fitted runs = -2789.24 + 0.63at_bats
homeruns to predict
runs. Using the estimates from the R output, write the
equation of the regression line. What does the slope tell us in the
context of the relationship between success of a team and its home
runs?SOLUTION:
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
Let’s create a scatterplot with the least squares line laid on top.
This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.
SOLUTION:
Calculate the residual:
They would predict 729 runs. Phillies had 713 runs, and 5579 at_bats, so we overestimated.
To assess whether the linear model is reliable, we need to check for Linearity, Independence, Normal errors, and Equal Variance.
Based on this, does the equal variance condition appear to be met?
SOLUTION:
SOLUTION: No, they do not appear normally distributed, the quantiles do not line up with that of a normal distribution.
SOLUTION: There seem to be some outliers away from the rest of the data.
YOU CAN SKIP THIS PROBLEM
set.seed(15)
a <- sort(runif(20))
b <- a*5 + rnorm(20)
b[10] <- 10
xyplot(b~a, pch=16, type=c("p", "r"))SOLUTION:
Part 2
mlb11 that you
think might be a good predictor of runs. Produce a
scatterplot of the two variables and fit a linear model. At a glance,
does there seem to be a linear relationship?At a glance there seems to be a strong linear relationship.
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.718 -27.179 -5.233 19.322 140.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.5600 151.1806 -2.484 0.0192 *
## hits 0.7589 0.1071 7.085 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6292
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.043e-07
runs and at_bats? Use the R\(^2\) values from the two model summaries to
compare. Does your variable seem to predict runs better
than at_bats? How can you tell?It looks like the relationship between runs and at_bats is weaker than that of hits.
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.718 -27.179 -5.233 19.322 140.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.5600 151.1806 -2.484 0.0192 *
## hits 0.7589 0.1071 7.085 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6292
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.043e-07
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
The R-square (measure of how much variance in the dependent variable that can be explained by the independent variable) is stronger in hits at .6419 than in at_bats at .3729.
This means that hits is a better predictor.
runs and
each of the other five traditional variables. Which variable best
predicts runs? Support your conclusion using the graphical
and numerical methods we’ve discussed (for the sake of conciseness, only
include output for the best variable, not all five).The R-Square is the highest with the bat_avg variable.
#m_homeruns <- lm(runs ~ homeruns, data = mlb11)
#summary(m_homeruns)
m_batavg <- lm(runs ~ bat_avg, data = mlb11)
summary(m_batavg)##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
#m_strikeouts <- lm(runs ~ strikeouts, data = mlb11)
#summary(m_strikeouts)
#m_stolenbases <- lm(runs ~ stolen_bases, data = mlb11)
#summary(m_stolenbases)
#m_wins <- lm(runs ~ wins, data = mlb11)
#summary(m_wins)runs? Using the limited (or not so limited)
information you know about these baseball statistics, does your result
make sense?##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.270 -18.335 3.249 19.520 69.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118.4 144.5 -7.741 1.97e-08 ***
## new_onbase 5654.3 450.5 12.552 5.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8437
## F-statistic: 157.6 on 1 and 28 DF, p-value: 5.116e-13
##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.80 68.71 -5.47 7.70e-06 ***
## new_slug 2681.33 171.83 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15
##
## Call:
## lm(formula = runs ~ stolen_bases, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -139.94 -62.87 10.01 38.54 182.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 677.3074 58.9751 11.485 4.17e-12 ***
## stolen_bases 0.1491 0.5211 0.286 0.777
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.82 on 28 degrees of freedom
## Multiple R-squared: 0.002914, Adjusted R-squared: -0.0327
## F-statistic: 0.08183 on 1 and 28 DF, p-value: 0.7769
These variables (especially new_onbase and new_slug) are extremely correlated with the target variable. They have even more predictive power than the original 7. This result makes sense because you cannot score without being on base and stealing bases leads to more runs.
##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.80 68.71 -5.47 7.70e-06 ***
## new_slug 2681.33 171.83 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15
#Call:
#lm(formula = runs ~ new_slug, data = mlb11)
#Residuals:
# Min 1Q Median 3Q Max
#-45.41 -18.66 -0.91 16.29 52.29
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -375.80 68.71 -5.47 7.70e-06 ***
#new_slug 2681.33 171.83 15.61 2.42e-15 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 26.96 on 28 degrees of freedom
#Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
#F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15