Lab 6–Let's Play Moneyball!

Moneyball

Load the Data

load(url("http://s3.amazonaws.com/assets.datacamp.com/course/dasi/mlb11.RData"))

Question 1

Compare 2 numeric variables with a scatter plot

Question 2

plot(mlb11$at_bats, mlb11$runs)

plot of chunk unnamed-chunk-2

We can see a weak linear relationship in the scatter plot.

Quantifying the strength of a linear relationship

correlation is measured by the function:

cor(var, var)

We use this to measure the degree of linear correlation between at bats and runs.

cor(mlb11$at_bats, mlb11$runs)
## [1] 0.6106

Question 3

This is a moderately strong relationship. The team with 5520 at bats is an outlier not because of the number of at bats but because of the number of runs it has scored compared to the other teams that have roughly similar numbers of at bats.

Estimating the Best Fit

Just as the mean is the best summary of a single numerical variable, the line of best fit is the best summary of the relationship between two numerical variables. The line is also known as the “regression line”.

We use the 'plot_ss' function to draw an estimated regression line. Two points that give a bad fit are given to get us started:

x1 = 5400
y1 = 750

x2 = 5700
y2 = 650

plot_ss(x = mlb11$at_bats, y = mlb11$runs, x1, y1, x2, y2, showSquares = TRUE)

plot of chunk unnamed-chunk-4

## 

## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##    2550.000       -0.333  
## 
## Sum of Squares:  302572

This gives a really poor fit, obviously. See how the deviations from the line by individual data points is measured by a blue line from the data point to the line. This line is called the 'error', i.e., how far off our predition is from the actual data. It is the two variable analogy to the deviations from the mean in the single variable case.

Notice how these blue lines form the side of a square made up of the blue line and three yellow lines. This is the basis for the 'squared error' term that is so important in measuring the goodness of fit, primarily through the R-squared term that will be introduced later.

I will take a crack at guessing at a good regression line myself.

x1 = 5450
y1 = 600
x2 = 5600
y2 = 700
plot_ss(x = mlb11$at_bats, y = mlb11$runs, x1, y1, x2, y2, showSquares = "TRUE")

plot of chunk unnamed-chunk-5

## 

## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -3033.333        0.667  
## 
## Sum of Squares:  183638

A better fit but still one that undershoots the mark considerably. The squared error of the data points below my line are considerably less in volume than the squared errors of the points above my line.

The Best Fit

Now we let the computer do it by taking out the coordinates and adding the argument leastSquares = “TRUE”

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = "TRUE", leastSquares = "TRUE")

plot of chunk unnamed-chunk-6

## 

## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##   -2789.243        0.631  
## 
## Sum of Squares:  123722

Bacon!!!

The Linear Model

Now we use the linear model to estimate the parameters of the line of best fit and save them to a new variable called m1.

m1 = lm(runs ~ at_bats, mlb11)
m1
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Coefficients:
## (Intercept)      at_bats  
##   -2789.243        0.631

Understanding the Output

summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -125.6  -47.0  -16.6   54.4  176.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.243    853.696   -3.27  0.00287 ** 
## at_bats         0.631      0.155    4.08  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared:  0.373,  Adjusted R-squared:  0.35 
## F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339

Breaking it Down

Now we will look at how much of the variations in runs scored can be explained by the number of home runs a team scores. We will do this with the lm() and summary() functions as before, but this time we will do it all in one line of code by passing the result of lm() to summary().

summary(lm(mlb11$runs ~ mlb11$homeruns))
## 
## Call:
## lm(formula = mlb11$runs ~ mlb11$homeruns)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91.61 -33.41   3.23  24.29 104.63 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     415.239     41.678    9.96  1.0e-10 ***
## mlb11$homeruns    1.835      0.268    6.85  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared:  0.627,  Adjusted R-squared:  0.613 
## F-statistic:   47 on 1 and 28 DF,  p-value: 1.9e-07

Question 4

Refiting the model as before:

summary(lm(mlb11$runs ~ mlb11$homeruns))
## 
## Call:
## lm(formula = mlb11$runs ~ mlb11$homeruns)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91.61 -33.41   3.23  24.29 104.63 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     415.239     41.678    9.96  1.0e-10 ***
## mlb11$homeruns    1.835      0.268    6.85  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared:  0.627,  Adjusted R-squared:  0.613 
## F-statistic:   47 on 1 and 28 DF,  p-value: 1.9e-07

According to the model, for every home run a team is predicted to score an additional 1.83 runs on average.

Prediction and Prediction Errors

Now run a regression on a scatter plot with a prediction (regression or line of best fit) line superimpossed.

# plot(at_bats, runs, data = mlb11) # It is giving me the residuals as the
# dependent variable, why?

# scatterplot(at_bats, runs, data=mlb11) # Now I am just gettting all kinds
# of things. I must have attached something to these variables that I am not
# aware of.

# plot(m1) #This gives me something unexpected, I think it is a plot of the
# standardized residuals?

# abline(m1)
rm(m1)  #get rid of any previous versions of m1 that might be lurking in the environment. 
# head(mlb11) plot(at_bats, runs, mlb11) #Error, 'at_bats' not found
plot(mlb11$at_bats, mlb11$runs)
m1 <- lm(runs ~ at_bats, data = mlb11)
m1
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Coefficients:
## (Intercept)      at_bats  
##   -2789.243        0.631
abline(m1)

plot of chunk unnamed-chunk-11

Question 5

To make a prediction with the regression line just plug in the value of the independent variable (i.e, at bats) you are interested in making a prediction with into the model.

-2789.2429 + 5579 * 0.6305
## [1] 728.3

The model predicts a team with 5579 at bats will produce, on average, 728 runs.

In the actual data the team that had 5579 at had 15 fewer runs than predicted by the model so the residual is -15.32. Here is one way to find that number:

residuals <- residuals(m1)
mlb11$at_bats
##  [1] 5659 5710 5563 5672 5532 5600 5518 5447 5544 5598 5585 5436 5549 5612
## [15] 5513 5579 5502 5509 5421 5559 5487 5508 5421 5452 5436 5528 5441 5486
## [29] 5417 5421

Question 6

Check that the assumptions of regression analysis are met.

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)

plot of chunk unnamed-chunk-14

Because the residuals are roughly evenly spread on both sides and exhibit no curved pattern we can conclude that the data are linear.

Question 7

Nearly normal residuals
To check for this we create a normal probability plot and add a diagonal line.

qqnorm(m1$residuals)
qqline(m1$residuals)

plot of chunk unnamed-chunk-15

# here is the histogram
hist(m1$residuals)

plot of chunk unnamed-chunk-15

The residuals are nearly normal, by inspection.

Question 8

The same plot we used for question 6 shows that the constant variability condition is also me. Question 9

Now we want to see how the other five traditional variables do at predicting runs. We run five models, one each for
homeruns
hits
wins
batting average

m1 <- lm(runs ~ at_bats, data = mlb11)
m2 <- lm(runs ~ homeruns, data = mlb11)
m3 <- lm(runs ~ hits, data = mlb11)
m4 <- lm(runs ~ wins, data = mlb11)
m5 <- lm(runs ~ bat_avg, data = mlb11)
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -125.6  -47.0  -16.6   54.4  176.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.243    853.696   -3.27  0.00287 ** 
## at_bats         0.631      0.155    4.08  0.00034 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.5 on 28 degrees of freedom
## Multiple R-squared:  0.373,  Adjusted R-squared:  0.35 
## F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339
summary(m2)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -91.61 -33.41   3.23  24.29 104.63 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  415.239     41.678    9.96  1.0e-10 ***
## homeruns       1.835      0.268    6.85  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.3 on 28 degrees of freedom
## Multiple R-squared:  0.627,  Adjusted R-squared:  0.613 
## F-statistic:   47 on 1 and 28 DF,  p-value: 1.9e-07
summary(m3)
## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103.72  -27.18   -5.23   19.32  140.69 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.560    151.181   -2.48    0.019 *  
## hits           0.759      0.107    7.09    1e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.2 on 28 degrees of freedom
## Multiple R-squared:  0.642,  Adjusted R-squared:  0.629 
## F-statistic: 50.2 on 1 and 28 DF,  p-value: 1.04e-07
summary(m4)
## 
## Call:
## lm(formula = runs ~ wins, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -145.45  -47.51   -7.48   47.35  142.19 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   342.12      89.22    3.83  0.00065 ***
## wins            4.34       1.09    3.98  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared:  0.361,  Adjusted R-squared:  0.338 
## F-statistic: 15.8 on 1 and 28 DF,  p-value: 0.000447
summary(m5)
## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -94.7  -26.3   -5.5   28.5  131.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -643        183   -3.51   0.0015 ** 
## bat_avg         5242        717    7.31  5.9e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.2 on 28 degrees of freedom
## Multiple R-squared:  0.656,  Adjusted R-squared:  0.644 
## F-statistic: 53.4 on 1 and 28 DF,  p-value: 5.88e-08

The answer comes from looking at the R-squared term for each model. Batting average explains over 64% of the variation in runs.

Question 10

Now look at three new variables. new_obs new_onbase new_slug and compare them to the more traditional indicators.

m6 <- lm(runs ~ new_obs, data = mlb11)
m7 <- lm(runs ~ new_slug, data = mlb11)
m8 <- lm(runs ~ new_onbase, data = mlb11)
summary(m6)
## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43.46 -13.69   1.16  13.94  41.16 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -686.6       68.9   -9.96    1e-10 ***
## new_obs       1919.4       95.7   20.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.4 on 28 degrees of freedom
## Multiple R-squared:  0.935,  Adjusted R-squared:  0.933 
## F-statistic:  402 on 1 and 28 DF,  p-value: <2e-16
summary(m7)
## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -375.8       68.7   -5.47  7.7e-06 ***
## new_slug      2681.3      171.8   15.60  2.4e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27 on 28 degrees of freedom
## Multiple R-squared:  0.897,  Adjusted R-squared:  0.893 
## F-statistic:  244 on 1 and 28 DF,  p-value: 2.42e-15
summary(m8)
## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58.27 -18.33   3.25  19.52  69.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1118        144   -7.74  2.0e-08 ***
## new_onbase      5654        450   12.55  5.1e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.6 on 28 degrees of freedom
## Multiple R-squared:  0.849,  Adjusted R-squared:  0.844 
## F-statistic:  158 on 1 and 28 DF,  p-value: 5.12e-13

Wow! The on base plus slugging measure explains just about everything.