## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.3
## v tibble 3.0.0 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Let’s load up the data for the 2011 season.
We would use a scatterplot. There's no obvious relationship revealed by the plot. However,
it can be possibly linear. We can confirm that the conditions for the linear model are satified
to make those predictions.
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
## [1] 0.610627
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
The residuals are the difference between the observed values and the values predicted by the line:
\[ e_i = y_i - \hat{y}_i \]
The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE
.
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
Note that the output from the plot_ss
function provides you with the slope and intercept of your line as well as the sum of squares.
It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm
function in R to fit the linear model (a.k.a. regression line).
The first argument in the function lm
is a formula that takes the form y ~ x
. Here it can be read that we want to make a linear model of runs
as a function of at_bats
. The second argument specifies that R should look in the mlb11
data frame to find the runs
and at_bats
variables.
The output of lm
is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function.
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
With this table, we can write down the least squares regression line for the linear model:
\[ \hat{y} = -2789.2429 + 0.6305 * atbats \]
One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, \(R^2\). The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 37.3% of the variability in runs is explained by at-bats.
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
\[ \hat{y} = 415.2389 + 1.8345 * homeruns \]
The slope tells us that for every homerun, we get 1.8345 runs.
Let’s create a scatterplot with the least squares line laid on top.
The function abline
plots a line based on its slope and intercept. Here, we used a shortcut by providing the model m1
, which contains both parameter estimates. This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.
From the below calculations, the team manager would predict 728 runs, which is a slight
overestimate compared to the data by 15.
Philadelphia Phillies, 5579 at bats, 713 runs.
## [1] 728
## [1] -15
To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
There doesn't appear to be a pattern in the plot of the residuals.
Nearly normal residuals: To check this condition, we can look at a histogram
or a normal probability plot of the residuals.
The points follow the line so the condition is met although there is some deviation on both side of the line.
Hard to tell from the scatterplot but it is more so confirmed in the residual plot.
From the scatterplot below, there does appear to be a linear relationship between hits
and runs
.
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.718 -27.179 -5.233 19.322 140.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.5600 151.1806 -2.484 0.0192 *
## hits 0.7589 0.1071 7.085 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6292
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.043e-07
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -375.5600 0.7589
##
## Sum of Squares: 70638.75
The R\(^2\) for the relationship between runs and at_bats is 37.3%.
The R\(^2\) for the relationship between runs and hits is 64.2%.
The variable hits does seem to predict runs better than at_bats. We can tell because the data is less spread away from the line and the area of the squares shown above is less than before.
mm_1 <- lm(runs ~ homeruns, data = mlb11)
mm_2 <- lm(runs ~ bat_avg, data = mlb11)
mm_3 <- lm(runs ~ strikeouts, data = mlb11)
mm_4 <- lm(runs ~ stolen_bases, data = mlb11)
mm_5 <- lm(runs ~ wins, data = mlb11)
vars <- c("homeruns","bat_avg","strikeouts","stolen_bases","wins")
r2 <- c(summary(mm_1)[[8]], summary(mm_2)[[8]], summary(mm_3)[[8]], summary(mm_4)[[8]], summary(mm_5)[[8]])
df <- tibble(var = vars, r2 = r2)
df <- df %>% arrange(desc(r2))
df
## # A tibble: 5 x 2
## var r2
## <chr> <dbl>
## 1 bat_avg 0.656
## 2 homeruns 0.627
## 3 wins 0.361
## 4 strikeouts 0.169
## 5 stolen_bases 0.00291
Based on the above calculations, bat_avg is the best predictor of runs based on the highest R\(^2\) of 65.6%
We can confirm this visually:
mn_1 <- lm(runs ~ new_onbase, data = mlb11)
mn_2 <- lm(runs ~ new_slug, data = mlb11)
mn_3 <- lm(runs ~ new_obs, data = mlb11)
vars <- c("new_onbase","new_slug","new_obs")
r2 <- c(summary(mn_1)[[8]], summary(mn_2)[[8]], summary(mn_3)[[8]])
df <- tibble(var = vars, r2 = r2)
df <- df %>% arrange(desc(r2))
df
## # A tibble: 3 x 2
## var r2
## <chr> <dbl>
## 1 new_obs 0.935
## 2 new_slug 0.897
## 3 new_onbase 0.849
The new variables are more effective at predicting runs than old variables as seen from the R\(^2\) values above. The best predictor is on-base plus slugging with an R\(^2\) of 93.5% and the regression line is plotted below. This result makes sense because on-base plus slugging incorporates many more ways to a player might get runs. It is calculated as the sum of on base percentage and slugging percentage which are themselves variables composed of the traditional variables.
As compared to the below articles on each new variable:
As seen below, there is no pattern in the residuals plot and the residuals are normally distributed as shown on the histogram and QQ plot.