Linear regression

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  3.0.0     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

The data

Let’s load up the data for the 2011 season.

load("more/mlb11.RData")

We would use a scatterplot. There's no obvious relationship revealed by the plot. However, 
it can be possibly linear. We can confirm that the conditions for the linear model are satified
to make those predictions.

ggplot(mlb11, aes(at_bats, runs)) + geom_point()

If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.

cor(mlb11$runs, mlb11$at_bats)

## [1] 0.610627

Sum of squared residuals

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

The residuals are the difference between the observed values and the values predicted by the line:

\[ e_i = y_i - \hat{y}_i \]

The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE.

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.

The linear model

It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm function in R to fit the linear model (a.k.a. regression line).

m1 <- lm(runs ~ at_bats, data = mlb11)

The first argument in the function lm is a formula that takes the form y ~ x. Here it can be read that we want to make a linear model of runs as a function of at_bats. The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.

The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function.

summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

With this table, we can write down the least squares regression line for the linear model:

\[ \hat{y} = -2789.2429 + 0.6305 * atbats \]

One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, \(R^2\). The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.

m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

\[ \hat{y} = 415.2389 + 1.8345 * homeruns \]

The slope tells us that for every homerun, we get 1.8345 runs.

Prediction and prediction errors

Let’s create a scatterplot with the least squares line laid on top.

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

The function abline plots a line based on its slope and intercept. Here, we used a shortcut by providing the model m1, which contains both parameter estimates. This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.

From the below calculations, the team manager would predict 728 runs, which is a slight
overestimate compared to the data by 15.   

Philadelphia Phillies, 5579 at bats, 713 runs.

y_hat <- -2789.2429 + 0.6305 * 5578
y_hat <- round(y_hat,0)
y_hat

## [1] 728

res <- 713-y_hat
res

## [1] -15

Model diagnostics

To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

There doesn't appear to be a pattern in the plot of the residuals.

Nearly normal residuals: To check this condition, we can look at a histogram

hist(m1$residuals)

or a normal probability plot of the residuals.

qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

The points follow the line so the condition is met although there is some deviation on both side of the line.



Hard to tell from the scatterplot but it is more so confirmed in the residual plot.

From the scatterplot below, there does appear to be a linear relationship between hits and runs.

m3 <- lm(runs ~ hits, data = mlb11)
summary(m3)

## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07

plot_ss(x = mlb11$hits, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -375.5600       0.7589  
## 
## Sum of Squares:  70638.75

The R\(^2\) for the relationship between runs and at_bats is 37.3%.
The R\(^2\) for the relationship between runs and hits is 64.2%.

The variable hits does seem to predict runs better than at_bats. We can tell because the data is less spread away from the line and the area of the squares shown above is less than before.

mm_1 <- lm(runs ~ homeruns, data = mlb11)
mm_2 <- lm(runs ~ bat_avg, data = mlb11)
mm_3 <- lm(runs ~ strikeouts, data = mlb11)
mm_4 <- lm(runs ~ stolen_bases, data = mlb11)
mm_5 <- lm(runs ~ wins, data = mlb11)
vars <- c("homeruns","bat_avg","strikeouts","stolen_bases","wins")
r2 <- c(summary(mm_1)[[8]], summary(mm_2)[[8]], summary(mm_3)[[8]], summary(mm_4)[[8]], summary(mm_5)[[8]])
df <- tibble(var = vars, r2 = r2)
df <- df %>% arrange(desc(r2))
df

## # A tibble: 5 x 2
##   var               r2
##   <chr>          <dbl>
## 1 bat_avg      0.656  
## 2 homeruns     0.627  
## 3 wins         0.361  
## 4 strikeouts   0.169  
## 5 stolen_bases 0.00291

Based on the above calculations, bat_avg is the best predictor of runs based on the highest R\(^2\) of 65.6%

We can confirm this visually:

plot(mlb11$runs ~ mlb11$bat_avg)
abline(mm_2)

mn_1 <- lm(runs ~ new_onbase, data = mlb11)
mn_2 <- lm(runs ~ new_slug, data = mlb11)
mn_3 <- lm(runs ~ new_obs, data = mlb11)
vars <- c("new_onbase","new_slug","new_obs")
r2 <- c(summary(mn_1)[[8]], summary(mn_2)[[8]], summary(mn_3)[[8]])
df <- tibble(var = vars, r2 = r2)
df <- df %>% arrange(desc(r2))
df

## # A tibble: 3 x 2
##   var           r2
##   <chr>      <dbl>
## 1 new_obs    0.935
## 2 new_slug   0.897
## 3 new_onbase 0.849

The new variables are more effective at predicting runs than old variables as seen from the R\(^2\) values above. The best predictor is on-base plus slugging with an R\(^2\) of 93.5% and the regression line is plotted below. This result makes sense because on-base plus slugging incorporates many more ways to a player might get runs. It is calculated as the sum of on base percentage and slugging percentage which are themselves variables composed of the traditional variables.

As compared to the below articles on each new variable:

plot(mlb11$runs ~ mlb11$new_obs)
abline(mn_3)

As seen below, there is no pattern in the residuals plot and the residuals are normally distributed as shown on the histogram and QQ plot.

plot(mn_3$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)

hist(mn_3$residuals)