Moneyball
Load the Data
load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/mlb11.RData"))
Question 1
Compare 2 numeric variables with a scatter plot
Question 2
plot(mlb11$at_bats, mlb11$runs)
We can see a weak linear relationship in the scatter plot.
Quantifying the strength of a linear relationship
correlation is measured by the function:
cor(var, var)
We use this to measure the degree of linear correlation between at bats and runs.
cor(mlb11$at_bats, mlb11$runs)
## [1] 0.6106
Question 3
This is a moderately strong relationship. The team with 5520 at bats is an outlier not because of the number of at bats but because of the number of runs it has scored compared to the other teams that have roughly similar numbers of at bats.
Estimating the Best Fit
Just as the mean is the best summary of a single numerical variable, the line of best fit is the best summary of the relationship between two numerical variables. The line is also known as the “regression line”.
We use the 'plot_ss' function to draw an estimated regression line. Two points that give a bad fit are given to get us started:
x1 = 5400
y1 = 750
x2 = 5700
y2 = 650
plot_ss(x = mlb11$at_bats, y = mlb11$runs, x1, y1, x2, y2, showSquares = TRUE)
##
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 2550.000 -0.333
##
## Sum of Squares: 302572
This gives a really poor fit, obviously. See how the deviations from the line by individual data points is measured by a blue line from the data point to the line. This line is called the 'error', i.e., how far off our predition is from the actual data. It is the two variable analogy to the deviations from the mean in the single variable case.
Notice how these blue lines form the side of a square made up of the blue line and three yellow lines. This is the basis for the 'squared error' term that is so important in measuring the goodness of fit, primarily through the R-squared term that will be introduced later.
I will take a crack at guessing at a good regression line myself.
x1 = 5450
y1 = 600
x2 = 5600
y2 = 700
plot_ss(x = mlb11$at_bats, y = mlb11$runs, x1, y1, x2, y2, showSquares = "TRUE")
##
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -3033.333 0.667
##
## Sum of Squares: 183638
A better fit but still one that undershoots the mark considerably. The squared error of the data points below my line are considerably less in volume than the squared errors of the points above my line.
The Best Fit
Now we let the computer do it by taking out the coordinates and adding the argument leastSquares = “TRUE”
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = "TRUE", leastSquares = "TRUE")
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -2789.243 0.631
##
## Sum of Squares: 123722
Bacon!!!
The Linear Model
Now we use the linear model to estimate the parameters of the line of best fit and save them to a new variable called m1.
m1 = lm(runs ~ at_bats, mlb11)
m1
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Coefficients:
## (Intercept) at_bats
## -2789.243 0.631
Understanding the Output
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.6 -47.0 -16.6 54.4 176.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.243 853.696 -3.27 0.00287 **
## at_bats 0.631 0.155 4.08 0.00034 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373, Adjusted R-squared: 0.35
## F-statistic: 16.6 on 1 and 28 DF, p-value: 0.000339
Breaking it Down
Now we will look at how much of the variations in runs scored can be explained by the number of home runs a team scores. We will do this with the lm() and summary() functions as before, but this time we will do it all in one line of code by passing the result of lm() to summary().
summary(lm(mlb11$runs ~ mlb11$homeruns))
##
## Call:
## lm(formula = mlb11$runs ~ mlb11$homeruns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.61 -33.41 3.23 24.29 104.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.239 41.678 9.96 1.0e-10 ***
## mlb11$homeruns 1.835 0.268 6.85 1.9e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627, Adjusted R-squared: 0.613
## F-statistic: 47 on 1 and 28 DF, p-value: 1.9e-07
Question 4
Refiting the model as before:
summary(lm(mlb11$runs ~ mlb11$homeruns))
##
## Call:
## lm(formula = mlb11$runs ~ mlb11$homeruns)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.61 -33.41 3.23 24.29 104.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.239 41.678 9.96 1.0e-10 ***
## mlb11$homeruns 1.835 0.268 6.85 1.9e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627, Adjusted R-squared: 0.613
## F-statistic: 47 on 1 and 28 DF, p-value: 1.9e-07
According to the model, for every home run a team is predicted to score an additional 1.83 runs on average.
Prediction and Prediction Errors
Now run a regression on a scatter plot with a prediction (regression or line of best fit) line superimpossed.
# plot(at_bats, runs, data = mlb11) # It is giving me the residuals as the
# dependent variable, why?
# scatterplot(at_bats, runs, data=mlb11) # Now I am just gettting all kinds
# of things. I must have attached something to these variables that I am not
# aware of.
# plot(m1) #This gives me something unexpected, I think it is a plot of the
# standardized residuals?
# abline(m1)
rm(m1) #get rid of any previous versions of m1 that might be lurking in the environment.
# head(mlb11) plot(at_bats, runs, mlb11) #Error, 'at_bats' not found
plot(mlb11$at_bats, mlb11$runs)
m1 <- lm(runs ~ at_bats, data = mlb11)
m1
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Coefficients:
## (Intercept) at_bats
## -2789.243 0.631
abline(m1)
Question 5
To make a prediction with the regression line just plug in the value of the independent variable (i.e, at bats) you are interested in making a prediction with into the model.
-2789.2429 + 5579 * 0.6305
## [1] 728.3
The model predicts a team with 5579 at bats will produce, on average, 728 runs.
In the actual data the team that had 5579 at had 15 fewer runs than predicted by the model so the residual is -15.32. Here is one way to find that number:
residuals <- residuals(m1)
mlb11$at_bats
## [1] 5659 5710 5563 5672 5532 5600 5518 5447 5544 5598 5585 5436 5549 5612
## [15] 5513 5579 5502 5509 5421 5559 5487 5508 5421 5452 5436 5528 5441 5486
## [29] 5417 5421
Question 6
Check that the assumptions of regression analysis are met.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)
Because the residuals are roughly evenly spread on both sides and exhibit no curved pattern we can conclude that the data are linear.
Question 7
Nearly normal residuals
To check for this we create a normal probability plot and add a diagonal line.
qqnorm(m1$residuals)
qqline(m1$residuals)
# here is the histogram
hist(m1$residuals)
The residuals are nearly normal, by inspection.
Question 8
The same plot we used for question 6 shows that the constant variability condition is also me. Question 9
Now we want to see how the other five traditional variables do at predicting runs. We run five models, one each for
homeruns
hits
wins
batting average
m1 <- lm(runs ~ at_bats, data = mlb11)
m2 <- lm(runs ~ homeruns, data = mlb11)
m3 <- lm(runs ~ hits, data = mlb11)
m4 <- lm(runs ~ wins, data = mlb11)
m5 <- lm(runs ~ bat_avg, data = mlb11)
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.6 -47.0 -16.6 54.4 176.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.243 853.696 -3.27 0.00287 **
## at_bats 0.631 0.155 4.08 0.00034 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared: 0.373, Adjusted R-squared: 0.35
## F-statistic: 16.6 on 1 and 28 DF, p-value: 0.000339
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.61 -33.41 3.23 24.29 104.63
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.239 41.678 9.96 1.0e-10 ***
## homeruns 1.835 0.268 6.85 1.9e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared: 0.627, Adjusted R-squared: 0.613
## F-statistic: 47 on 1 and 28 DF, p-value: 1.9e-07
summary(m3)
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.72 -27.18 -5.23 19.32 140.69
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.560 151.181 -2.48 0.019 *
## hits 0.759 0.107 7.09 1e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.2 on 28 degrees of freedom
## Multiple R-squared: 0.642, Adjusted R-squared: 0.629
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.04e-07
summary(m4)
##
## Call:
## lm(formula = runs ~ wins, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.45 -47.51 -7.48 47.35 142.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 342.12 89.22 3.83 0.00065 ***
## wins 4.34 1.09 3.98 0.00045 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared: 0.361, Adjusted R-squared: 0.338
## F-statistic: 15.8 on 1 and 28 DF, p-value: 0.000447
summary(m5)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.7 -26.3 -5.5 28.5 131.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -643 183 -3.51 0.0015 **
## bat_avg 5242 717 7.31 5.9e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.2 on 28 degrees of freedom
## Multiple R-squared: 0.656, Adjusted R-squared: 0.644
## F-statistic: 53.4 on 1 and 28 DF, p-value: 5.88e-08
The answer comes from looking at the R-squared term for each model. Batting average explains over 64% of the variation in runs.
Question 10
Now look at three new variables. new_obs new_onbase new_slug and compare them to the more traditional indicators.
m6 <- lm(runs ~ new_obs, data = mlb11)
m7 <- lm(runs ~ new_slug, data = mlb11)
m8 <- lm(runs ~ new_onbase, data = mlb11)
summary(m6)
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.46 -13.69 1.16 13.94 41.16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.6 68.9 -9.96 1e-10 ***
## new_obs 1919.4 95.7 20.06 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.4 on 28 degrees of freedom
## Multiple R-squared: 0.935, Adjusted R-squared: 0.933
## F-statistic: 402 on 1 and 28 DF, p-value: <2e-16
summary(m7)
##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.8 68.7 -5.47 7.7e-06 ***
## new_slug 2681.3 171.8 15.60 2.4e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27 on 28 degrees of freedom
## Multiple R-squared: 0.897, Adjusted R-squared: 0.893
## F-statistic: 244 on 1 and 28 DF, p-value: 2.42e-15
summary(m8)
##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.27 -18.33 3.25 19.52 69.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118 144 -7.74 2.0e-08 ***
## new_onbase 5654 450 12.55 5.1e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.6 on 28 degrees of freedom
## Multiple R-squared: 0.849, Adjusted R-squared: 0.844
## F-statistic: 158 on 1 and 28 DF, p-value: 5.12e-13
Wow! The on base plus slugging measure explains just about everything.