Introduction to Linear Regression

The Data

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

Exercise 1: What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?

I would use a scatterplot to display the relationship between runs and at-bats.

plot(mlb11$at_bats,mlb11$runs)

# Yes, this relationship looks moerately linear. However, even if I knew a team's at_bats, I would not be that comfortable using a linear model to predict the number of runs from the number of at_bats because the correlation seems fairly mild based on the scatterplot. In other words, at_bats do not appear to be an entirely reliable predictor of runs. 
cor(mlb11$runs, mlb11$at_bats)

## [1] 0.610627

# The correlation coefficient is above 0.50, however, at 0.61, this suggests only moderate linear correlation.

Exercise 2: Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

There seems to be positive linear correlation because higher amounts of at_bats are associated with high amounts of runs, however, there is also a lot of spread between the observations. While most observations follow this relationship, for some low at_bat amounts there are high amounts of runs and for some high at_bat amounts there are low amounts of runs.

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Exercise 3: Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

The smallest sum of squares I got was 123,721.9, which I believe was the exact least sum of squared residuals. This exact lowest possible sum of squared residuals number much lower than some of the other guesses I attempted when attempting to find a line of best fit by clicking on two points. The closest I got to approximating the exact least sum of squared residuals was 132,059.9.

The Linear Model

m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Exercise 4: Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

# ŷ=415.2389 +1.8345*atbats
# The slope of 1.8345 tells us that for almost every additional homerun, there will be 1.8345 times as many runs scored. Thus, the success of a team is highly dependent upon homeruns since every homereun results in nearly two runs scored.

Prediction and Prediction Errors

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

Exercise 5: If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

A team manager would predict 728 runs for a team with 5,578 at-bats based on the least squares regression line. This is an overstimate at the actual data shows 5,578 at-bats results in about 700 runs. The residual is approximately -28.

Model Diagnostics

Exercise 6: Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)

# It appears that the residuals corresponding to the middle portion of at-bats have higher variability than the residuals corresponding to low or high amounts of at-bats.

Exercise 7: Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

hist(m1$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals)

# The histogram and the normal probability plot both indicate that the residuals are very nearly normally distributed.

Exercise 8: Based on the plot in (1), does the constant variability condition appear to be met?

It seems that the constant variability condition is not entirely met in the residuals plot given that the residuals increase for at-bat values in the middle portion of the residuals plot, though the variability is low enough that one could probably claim that the condition is met.

On Your Own:

1. Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

plot(mlb11$runs~mlb11$bat_avg)
m3 <- lm(runs~bat_avg,data = mlb11)
abline(m3)

# Yes, there does seem to be a linear relationship between batting average and runs. Higher batting averages are associated with more runs and vice versa.

2. How does this relationship compare to the relationship between runs and at_bats? Use the R2 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?

summary(m3)

## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

# The relationship of at-bats to runs is similar to the relationship between batting average and runs, however, there is stronger positive linear correlation between batting average and runs than between at-bats and runs. The R2 values confirm this as the R2 value for runs as a function of at bats is only 0.37 whereas the R2 value for runs as a function of batting average is 0.65. 
plot_ss(x = mlb11$bat_avg, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      -642.8       5242.2  
## 
## Sum of Squares:  67849.52

cor(mlb11$bat_avg,mlb11$runs)

## [1] 0.8099859

# Yes, the batting average variable predicts runs better than at-bats and this can be inferred from the sum of squred residuals, which is far lower for the batting average variable at 67,849.52 than for the at-bat variable at 123,721.9.

3. Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).

# It appears that batting average is the best predictor of runs ut of the seven traditionally used variables. It had the highest correlation coefficient, edging out homeruns and hits slightly, at 0.809 as well as the lowest sum of squared residuals at 67,849.52.
plot_ss(x = mlb11$hits,y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -375.5600       0.7589  
## 
## Sum of Squares:  70638.75

plot_ss(x = mlb11$homeruns,y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     415.239        1.835  
## 
## Sum of Squares:  73671.99

4. Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?

plot(mlb11$runs ~ mlb11$new_onbase)

plot(mlb11$runs ~ mlb11$new_slug)

plot(mlb11$runs ~ mlb11$new_obs)

plot_ss(x = mlb11$new_onbase,y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##       -1118         5654  
## 
## Sum of Squares:  29768.7

plot_ss(x = mlb11$new_slug,y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      -375.8       2681.3  
## 
## Sum of Squares:  20345.54

plot_ss(x = mlb11$new_obs,y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      -686.6       1919.4  
## 
## Sum of Squares:  12837.66

cor(mlb11$new_onbase,mlb11$runs)

## [1] 0.9214691

cor(mlb11$new_slug,mlb11$runs)

## [1] 0.9470324

cor(mlb11$new_obs,mlb11$runs)

## [1] 0.9669163

# In general, they are far more effective at predicting runs than the former traditionally used variables. All three correlation coefficients were above 0.92 and the sums of squares for all three were lower than the sum of squares for batting average. This makes sense considering that slugging percentage gives more weight to doubles, triples, and homeruns than singles unlike the batting average measure. Likewise, this makes sense because onbase pecentage measures how frequently a batter gets on base, whether via hits, walks, and times hit by a pitch, versus just hits. The best predcitor of runs is obs as it had the highest correlation coefficient and lowest sum of squaes which makes sense since it combines the on base percentage and slugging percentage.

5. Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

plot(mlb11$runs~mlb11$new_obs)
m4 <- lm(runs~new_obs,data = mlb11)
abline(m4)

summary(m4)

## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

# The R2 value of 0.9349 indicates that 93.5% of the variability in runs is explained by obs, which clearly shows that obs is the best predictor of runs of the 10 variables