Batter up

The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, betterpredict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average. Obtaining players who excelled in these underused statistics turned out to be much more affordable for the team.

In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.

The data

Let’s load up the data for the 2011 season.

load(url("http://www.openintro.org/stat/data/mlb11.RData"))
  1. What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?
  • scatterplot
  • linear relationship
  • linear model can be used to predict the number of runs
ggplot(mlb11, aes(x=at_bats, y=runs)) + geom_point()

If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.

cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627

Sum of squared residuals

  1. Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.
  • direction: positive
  • form: linear, no strong curvature
  • strength: weak, the points of the data are far spread from each other
plot_ss(x = mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9
  1. Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9
  • Sum of Squares: 123721.9
  • it’s the same on every run
  • it’s the same for all the neighbors

The linear model

m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388
  1. Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?
plot_ss(x = mlb11$homeruns, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     415.239        1.835  
## 
## Sum of Squares:  73671.99
cor(mlb11$runs, mlb11$homeruns)
## [1] 0.7915577
summary(lm(runs ~ homeruns, data = mlb11))
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

\({\beta}_0\) = 415.2389
\({\beta}_1\) = 1.8345
y = \({\beta}_0\) + \({\beta}_1\) . x
=> runs = 415.2389 + 1.8345 * homeruns

Prediction and prediction errors

Let’s create a scatterplot with the least squares line laid on top.

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

  1. If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?
data.frame(mlb11$runs,mlb11$at_bats)
##    mlb11.runs mlb11.at_bats
## 1         855          5659
## 2         875          5710
## 3         787          5563
## 4         730          5672
## 5         762          5532
## 6         718          5600
## 7         867          5518
## 8         721          5447
## 9         735          5544
## 10        615          5598
## 11        708          5585
## 12        644          5436
## 13        654          5549
## 14        735          5612
## 15        667          5513
## 16        713          5579
## 17        654          5502
## 18        704          5509
## 19        731          5421
## 20        743          5559
## 21        619          5487
## 22        625          5508
## 23        610          5421
## 24        645          5452
## 25        707          5436
## 26        641          5528
## 27        624          5441
## 28        570          5486
## 29        593          5417
## 30        556          5421
  • given: at_bats = 5578
  • predicted_runs = -2789.2429 + 0.6305 ∗ at_bats = 727.6861
  • if we consider at_bats = 5579, its observed_runs = 713
  • let’s calculate the residual: e = observed_runs - predicted_runs = -14.6861
  • this means we overestimated the runs by 14.6861

Model diagnostics

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

  1. Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?
  • no pattern noticed in the residuals plot
  • constant variability: satisfied because the residual plot shows how the spread is the same everywhere

Nearly normal residuals: To check this condition, we can look at a histogram

hist(m1$residuals)

or a normal probability plot of the residuals.

qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

  1. Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?
  • the nearly normal residuals condition appear to be met

Constant variability:

  1. Based on the plot in (1), does the constant variability condition appear to be met?
  • constant variability: satisfied

On Your Own

plot(x = mlb11$new_obs, y = mlb11$runs)

  • strong linear relationship
summary(lm(runs ~ new_obs, data = mlb11))
## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16
no<-summary(lm(runs ~ new_obs, data = mlb11))$r.squared
r<-summary(m1)$r.squared
  • this relationship looks stronger compared to the relationship between runs and at_bats
  • new_obs R^2 is: 0.9349271 and runs R^2 is: 0.3728654
  • new_obs is a better predictor for runs than to at_bats
plot(x = mlb11$bat_avg, y = mlb11$runs)

summary(lm(runs ~ bat_avg, data = mlb11))
## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

By simply comparing each of the r-squared values, we find that “bat_avg”, or batting average is the best predictor of runs scored, of the “traditional” variables.

summary(lm(runs ~ new_onbase, data = mlb11))$r.squared
## [1] 0.8491053
summary(lm(runs ~ new_slug, data = mlb11))$r.squared
## [1] 0.8968704
summary(lm(runs ~ hits, data = mlb11))$r.squared
## [1] 0.6419388
plot(mlb11$runs~mlb11$hits)
abline(lm(runs ~ hits, data = mlb11))

  • these stats would be better predictors than the traditional stats
plot(m1$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)

hist(m1$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals) 

  • No overwhelming pattern
  • Histogram looks normal
  • Normal probability plot of residuals looks good