Introduction to Linear Regression

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

Exercise 1

What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?

I would use a scatterplot. The relationship does appear to have a linear trend. As such, if I knew a team’s at_bats, I would be comfortable using a linear model to predict the number of runs.

plot(mlb11$at_bat, mlb11$runs, xlab = "At-Bat", ylab = "Runs")

Exercise 2

Describe the relationship between at_bats and runs. Discuss the form, direction, and strength of the relationship as well as any unusual observations.

round(cor(mlb11$runs, mlb11$at_bats),3)

## [1] 0.611

The relationship between these two variables is linear given the lack of curvature in the scatterplot above. There exists a somewhat strong, positive correlation of 0.611 between them. The data does include outliers, but not any to significantly displace any conclusions about the dataset.

Exercise 3

Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

The lowest sum of squares I came across using this method was 136,450.8.

Exercise 4

Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

home.runs=lm(mlb11$homeruns~ mlb11$runs)
home.runs

## 
## Call:
## lm(formula = mlb11$homeruns ~ mlb11$runs)
## 
## Coefficients:
## (Intercept)   mlb11$runs  
##    -85.1566       0.3415

summary(home.runs)

## 
## Call:
## lm(formula = mlb11$homeruns ~ mlb11$runs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.067 -15.794   3.702  15.766  39.232 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -85.15663   34.79698  -2.447   0.0209 *  
## mlb11$runs    0.34154    0.04983   6.854  1.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.13 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

y-hat = -85.16 + (0.342)(homeruns). For every one unit increase in homeruns, the number of runs is estimated to increase by 0.342.

Exercise 5

If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

pred = -2789.2429+0.63058*(5578)
pred

## [1] 728.1323

head(mlb11[order(mlb11$at_bats, decreasing = T),],9)

##                     team runs at_bats hits homeruns bat_avg strikeouts
## 2         Boston Red Sox  875    5710 1600      203   0.280       1108
## 4     Kansas City Royals  730    5672 1560      129   0.275       1006
## 1          Texas Rangers  855    5659 1599      210   0.283        930
## 14       Cincinnati Reds  735    5612 1438      183   0.256       1250
## 6          New York Mets  718    5600 1477      108   0.264       1085
## 10        Houston Astros  615    5598 1442       95   0.258       1164
## 11     Baltimore Orioles  708    5585 1434      191   0.257       1120
## 16 Philadelphia Phillies  713    5579 1409      153   0.253       1024
## 3         Detroit Tigers  787    5563 1540      169   0.277       1143
##    stolen_bases wins new_onbase new_slug new_obs
## 2           102   90      0.349    0.461   0.810
## 4           153   71      0.329    0.415   0.744
## 1           143   96      0.340    0.460   0.800
## 14           97   79      0.326    0.408   0.734
## 6           130   77      0.335    0.391   0.725
## 10          118   56      0.311    0.374   0.684
## 11           81   69      0.316    0.413   0.729
## 16           96  102      0.323    0.395   0.717
## 3            49   95      0.340    0.434   0.773

mlb11[16,2]-pred

## [1] -15.13234

If a team manager was to predict the team runs based on the least squares regression line, they would probably predict approximately 728 runs. However, on line 16 of the dataset, there is a play that has 5579 at_bats (one more than the at_bats of our hypothetical team) and the runs for this play was 713, which is 15 less than the prediction.

Exercise 6

Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

m1 = lm(runs ~ at_bats, data = mlb11)
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)

There is no distinct pattern, implying that a linear relationship exists between runs and at-bats.

Exercise 7

Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

hist(m1$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals)

Yes. The histogram does appear normal with an almost unnoticeable right skew and the normal probability plot points fall mostly along the line, both of which are criteria implying that the nearly normal residuals conditions appear to be met.

Exercise 8

Based on the plot in (1), does the constant variability condition appear to be met?

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)

Yes, the constant variability condition is met.The point do not fan out or open like a funnel (or vice versa) and overall appear to have a constant variability throughout.

ON YOUR OWN

1. Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

mhits = lm(runs ~ hits, data = mlb11)
plot(mlb11$runs ~ mlb11$hits)
abline(mhits)

I would assume that the more hits a team makes, the more chances they have for a run. After creating a scatterplot and fitting a linear model, it appeared, at a glance, that there appeared to be a linear relationship between hits and runs.

2. How does this relationship compare to the relationship between runs and at_bats (compare R^2 values from the two model summaries)? Does your variable seem to predict runs better than at_bats? How can you tell?

summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

summary(mhits)

## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07

Regardless of whether we are comparing the multiple-r-squared or the adjusted-r-squared of the two models, in both cases the hits variable appears to predict runs better than at_bats since the r-squared value is higher and therefore the variability that can be explained by the model is higher.

3. Investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (only include output for the best variable, not all five).

# batting average
mbatav = lm(runs ~ bat_avg, data = mlb11)
plot(mlb11$runs ~ mlb11$bat_avg)
abline(mbatav)

summary(mbatav)

## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

It appeared that the batting average was the variable which best predicted runs, even though the r-squared value (at approximately 65.61%) was not significantly higher than the runs’ variable (at approximately 64.19%).

4. Now examine the three newer variables. In general, are they more or less effective at predicting runs than the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?

# on base
summary(lm(runs ~ new_onbase, data = mlb11))

## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13

# slug
summary(lm(runs ~ new_slug, data = mlb11))

## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15

# obs
summary(lm(runs ~ new_obs, data = mlb11))

## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

The three new variable are more effective at predicting runs with r-squared values ranging from 84.91% to 93.49%, a significant increase from our previously closest predicting variable that had the highest r-squared value of 65.61%. Overall, the new_obs variable was the best variable in predicting runs.

5. Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

# scatterplot and regression line
obs = lm(runs ~ new_obs, data = mlb11)
plot(mlb11$runs ~ mlb11$new_obs, main = "New_Obs Scatterplot and Regression Line")
abline(obs)

# histogram
hist(obs$residuals, main = "New_Obs Residuals")

# normal probability plot
qqnorm(obs$residuals)
qqline(obs$residuals)

# constant variability
plot(obs$residuals ~ mlb11$new_obs, main = "Variability")
abline(h = 0, lty = 3)

There appears to be a linear relationship between the two variables and the data is normally distributed, as seen by the symmetrical, unimodal histogram. Normality is further implied by most points falling along or close to the line on the qq plot. Finally, there appears to be constant variability as seen from the final plot, where the points do not appear to follow a “fanning out” formation or any particular pattern that would assume otherwise.

Introduction to Linear Regression

Georgia Galanopoulos

ON YOUR OWN