MC MATH217 R Lab 8: Intro to Linear Regression

Loading data

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")

Exercise 1

Plotting scatterplot of runs vs. at-bats

plot(x = mlb11$at_bats, y = mlb11$runs, xlab = "At-Bats", ylab = "Runs")

The relationship between “at-bats” and “runs” seems loosely linear. I would feel confident enough to use a linear model to predict runs based on at bats.

Finding correlation coefficient between “at-bats” and “runs”

cor(mlb11$at_bats, mlb11$runs)

## [1] 0.610627

Exercise 2

The relationship between “at-bats” and “runs” is positive and somewhat linear with a correlation coefficient of about 0.610627. Most teams in our scatterplot had between 5400 and 5600 at-bats, with one team having over 5700 at-bats and over 850 runs. There is one team who seems extremely efficient, scoring over 850 runs in under 5525 at-bats. There are a few who seem particularly efficient, with one scoring around 725 runs in about 5675 at-bats and another scoring just over 500 runs in just under 5600 at-bats, among others.

Creating interactive plots

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Exercise 3

I wasn’t able to physically select my line (maybe because I’m running this in a markdown file and not a shell) but running the last chunk seems to have automatically found the line that best minimizes my sum of squares (123721.9).

Using lm function to create a linear model that uses at-bats to predict runs

m1 <- lm(runs ~ at_bats, data = mlb11)

summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Exercise 4

Fitting new linear model that uses home runs to predict runs

m2 <- lm(runs ~ homeruns, data = mlb11)

summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

Our equation for the regression line is: y = 415.2389 + 1.8345 * homeruns The slope shows a very strong positive relationship between home runs and overall runs. Our multiple R-squared value shows 62.66% of the variability in runs is explained by home runs. This all suggests that the more successful teams score more home runs.

Creating scatterplot with least squares line laid on top

plot(mlb11$runs ~ mlb11$at_bats, xlab = "At-Bats", ylab = "Runs")
abline(m1)

Exercise 5

-2789.2429 + 0.6305 * 5578

## [1] 727.6861

We would predict 727.6861 runs, so about 728 runs (to be generous). It’s an overestimate, but because we don’t have a data point of a team that had 5578 at-bats, I’m not sure how we’d calculate the residual. We do have a team with 5579 at-bats, though: the Phillies. We can calculate the residual for them.

(-2789.2429 + 0.6305 * 5579) - 713

## [1] 15.3166

Our residual for the Phillies is 15.3166, so our least squares line is an overestimate of over 15 runs for this data point.

Plotting residuals

plot(m1$residuals ~ mlb11$at_bats, xlab = "At-Bats", ylab = "M1 Residuals")
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0

Exercise 6

There doesn’t seem to be any apparent pattern in our residual plot. This suggests a linear model is a good fit for this data.

Creating histogram of residuals

hist(m1$residuals)

Creating normal probability plot of residuals

qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to normal probability plot

Exercise 7

Our histogram doesn’t appear to follow a normal distribution too well, but it’s not drastically off. Our Q-Q plot seems to much more reasonably follow a normal distribution. It’s not super clear, but I would say the normal residuals condition appears to be met.

Exercise 8

Our residuals plot in Exercise 6 shows no pattern so the constant variability condition appears to be met.

On Your Own 1 and 2

Exploring runs vs. wins

plot(x = mlb11$wins, y = mlb11$runs, xlab = "Wins", ylab = "Runs")

m3 <- lm(runs ~ wins, data = mlb11)

summary(m3)

## 
## Call:
## lm(formula = runs ~ wins, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.450  -47.506   -7.482   47.346  142.186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  342.121     89.223   3.834 0.000654 ***
## wins           4.341      1.092   3.977 0.000447 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared:  0.361,  Adjusted R-squared:  0.3381 
## F-statistic: 15.82 on 1 and 28 DF,  p-value: 0.0004469

Our scatterplot for runs vs. wins appears loosely linear like our scatterplot for runs vs. at-bats. Our multiple R-squared value here is lower than the one for runs vs. at-bats, so wins seems to predict runs worse than at-bats.

Our Your Own 3

Exploring runs vs. hits

plot(x = mlb11$hits, y = mlb11$runs, xlab = "Hits", ylab = "Runs")

m4 <- lm(runs ~ hits, data = mlb11)

summary(m4)

## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07

Exploring runs vs. batting average

plot(x = mlb11$bat_avg, y = mlb11$runs, xlab = "Batting Average", ylab = "Runs")

m5 <- lm(runs ~ bat_avg, data = mlb11)

summary(m5)

## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

Exploring runs vs. strikeouts

plot(x = mlb11$strikeouts, y = mlb11$runs, xlab = "Strikeouts", ylab = "Runs")

m6 <- lm(runs ~ strikeouts, data = mlb11)

summary(m6)

## 
## Call:
## lm(formula = runs ~ strikeouts, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -132.27  -46.95  -11.92   55.14  169.76 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1054.7342   151.7890   6.949 1.49e-07 ***
## strikeouts    -0.3141     0.1315  -2.389   0.0239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1397 
## F-statistic: 5.709 on 1 and 28 DF,  p-value: 0.02386

Exploring runs vs. stolen bases

plot(x = mlb11$stolen_bases, y = mlb11$runs, xlab = "Stolen Bases", ylab = "Runs")

m7 <- lm(runs ~ stolen_bases, data = mlb11)

summary(m7)

## 
## Call:
## lm(formula = runs ~ stolen_bases, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -139.94  -62.87   10.01   38.54  182.49 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  677.3074    58.9751  11.485 4.17e-12 ***
## stolen_bases   0.1491     0.5211   0.286    0.777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.82 on 28 degrees of freedom
## Multiple R-squared:  0.002914,   Adjusted R-squared:  -0.0327 
## F-statistic: 0.08183 on 1 and 28 DF,  p-value: 0.7769

Further exploring runs vs. batting average with a scatterplot with a least squares line laid on top and also plotting the residuals

plot(mlb11$runs ~ mlb11$bat_avg, xlab = "Batting Average", ylab = "Runs")
abline(m5)

plot(m5$residuals ~ mlb11$bat_avg, xlab = "At Bats", ylab = "M5 Residuals")
abline(h = 0, lty = 3)

It seems batting average is the best predictor of runs out of our remaining five variables. It has a strong multiple R-squared value of 0.6561.

On Your Own 4

Exploring runs vs. new on-base percentage

plot(x = mlb11$new_onbase, y = mlb11$runs, xlab = "New On-Base Percentage", ylab = "Runs")

m8 <- lm(runs ~ new_onbase, data = mlb11)

summary(m8)

## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13

Exploring runs vs. new slugging percentage

plot(x = mlb11$new_slug, y = mlb11$runs, xlab = "New Slugging Percentage", ylab = "Runs")

m9 <- lm(runs ~ new_slug, data = mlb11)

summary(m9)

## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15

Exploring runs vs. new on-base plus slugging percentage

plot(x = mlb11$new_obs, y = mlb11$runs, xlab = "New On-Base Plug Slugging Percentage", ylab = "Runs")

m10 <- lm(runs ~ new_obs, data = mlb11)

summary(m10)

## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

New On-Base plus Slugging percentage seems to be the best predictor of runs. This makes sense as OBS is a more complete statistic regarding batting. It accounts for on-base percentage, which measures how often a batter gets on base, as well as slugging percentage, which measures how productive a batter is while accounting for power (meaning it weighs extra base hits and home runs more than singles). The team with better batters should logically score more runs.

On Your Own 5

Plotting residuals of runs vs. new OBS

plot(m10$residuals ~ mlb11$new_obs, xlab = "New OBS", ylab = "M10 Residuals")
abline(h = 0, lty = 3)

Plotting histogram of residuals

hist(m10$residuals)