1 indicates a strong positive relationship.

#-1 indicates a strong negative relationship. # A result of zero indicates no relationship at all.

cor(mlb11$runs, mlb11$at_bats)

## [1] 0.610627

Sum of squared residuals

Exercise 2: Description of the relationship of the 2 variables: form, direction, and strength.

The data appears to be skewed right and to have a positive-moderate exponentail relationship.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

mlb11 %>%
  ggplot()+
  geom_boxplot(mapping = aes(x = at_bats, y = runs))

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

mlb11 %>%
  ggplot() +
  geom_histogram(mapping = aes(x = at_bats))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Note that there are 30 residuals, one for each of the 30 observations.

Recall that the residuals are the difference between the observed values and the values predicted by the line

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

The most common way to do linear regression is to select the line that minimizes the sum of squared residuals.

To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE.

Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Exercise 3: Using plot_ss, choose a line that does a good job of minimizing the sum of squares.

Run the function several times. What was the smallest sum of squares that you got? Sum of Squares: 139224.4

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

The Linear Model

It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error.

Instead we can use the lm function in R to fit the linear model (a.k.a. regression line)

m1 <- lm(runs ~ at_bats, data = mlb11)

The first argument in the function lm is a formula that takes the form y ~ x.

Here it can be read that we want to make a linear model of runs as a function of at_bats.

The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.

The output of lm is an object that contains all of the information we need about the linear model that was just fit.

We can access this information using the summary function.

summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Let’s consider this output piece by piece. First, the formula used to describe the model is shown at the top.

After the formula you find the five-number summary of the residuals. The “Coefficients” table shown next is key;

its first column displays the linear model’s y-intercept and the coefficient of at_bats.

With this table, we can write down the least squares regression line for the linear model:

ŷ =−2789.2429+0.6305∗atbats

One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, R^2.

The R^2 value represents the proportion of variability in the response variable that is explained by the explanatory variable.

For this model, 37.3% of the variability in runs is explained by at-bats.

Exercise 4: Fit a new model that uses homeruns to predict runs.

Using the estimates from the R output, write the equation of the regression line.

What does the slope tell us in the context of the relationship between success of a team and its home runs?

plot_ss(x = mlb11$homeruns, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     415.239        1.835  
## 
## Sum of Squares:  73671.99

plot_ss(x = mlb11$homeruns, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     415.239        1.835  
## 
## Sum of Squares:  73671.99

m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

ŷ =415.2389+1.8345∗homeruns, for this model 62.66% of the variability in runs is explained by homeruns. For homerun, runs increase 1.83 times.

Prediction and Prediction errors

Let’s create a scatterplot with the least squares line laid on top.

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

The function abline plots a line based on its slope and intercept. Here, we used a shortcut by providing the model m1, which contains both parameter estimates.

This line can be used to predict y at any value of x. When predictions are made for values of x that are beyond the range of the observed data, it is referred to

as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.

Exercise 5: If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats?

Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

Model diagnostics: To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot.

We should also verify this condition with a plot of the residuals vs. at-bats.

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

Exercise 6: Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

Nearly normal residuals: To check this condition, we can look at a histogram

hist(m1$residuals)

or a normal probability plot of the residuals.

qqnorm(m1$residuals)
qqline(m1$residuals)  # adds diagonal line to the normal prob plot

Exercise 7: Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met? yes, both depict close to normal distribution,

with thinner tails than those of a normal distribution.

Constant variability:

Exercise 8: Based on the plot in (1), does the constant variability condition appear to be met? Yes, there seems to be constant variability between the at_bats and runs.

On your OWN:

#1 - Plotting the relationship between stolen_bases and runs. This doesnt show a very good linear relationship. The correlation coefficient is only 0.054.

plot(mlb11$stolen_bases, mlb11$runs, 
     pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#cc0000", # red
     main = "Relationship between stolen_bases & runs", 
     xlab = "stolen_bases", 
     ylab = "runs")

cor(mlb11$runs, mlb11$stolen_bases)

## [1] 0.05398141

mlb11 %>%
        ggplot()+
        geom_point(mapping = aes(x = stolen_bases, y = runs), color = "red")+
        geom_smooth(mapping = aes(x = stolen_bases, y = runs),method = 'lm',se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

#2

m3 <- lm(runs ~ stolen_bases, data = mlb11)
summary(m3)

## 
## Call:
## lm(formula = runs ~ stolen_bases, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -139.94  -62.87   10.01   38.54  182.49 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  677.3074    58.9751  11.485 4.17e-12 ***
## stolen_bases   0.1491     0.5211   0.286    0.777    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.82 on 28 degrees of freedom
## Multiple R-squared:  0.002914,   Adjusted R-squared:  -0.0327 
## F-statistic: 0.08183 on 1 and 28 DF,  p-value: 0.7769

For this model, 0.29% of the variability in runs is explained by stolen_bases, compared to 37.3% explained by at_bats. This model also has a lower correlation coefficient than

at_bats. Therefore, the variable at_bats is a better predictor of runs than it is stolen_bases.

#3: Based on the graphical and numerical methods observed for the variables the variable that it is a better predictor of runs is the batting avg.

Variable 1 - hits

scatterplot: does show a good exponential linear relationship.

plot(mlb11$hits, mlb11$runs, 
     pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#cc0000", # red
     main = "Relationship between hits & runs", 
     xlab = "hits", 
     ylab = "runs")

correlation coefficient: 0.801..stronger than at_bats

cor(mlb11$runs, mlb11$hits)

## [1] 0.8012108

scatterplot with fitted line

mlb11 %>%
        ggplot()+
        geom_point(mapping = aes(x = hits, y = runs), color = "red")+
        geom_smooth(mapping = aes(x = hits, y = runs),method = 'lm',se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

Linear model: R^2 - 0.64 or 64% of the variability in runs is explained by the variable hits.

m4 <- lm(runs ~ hits, data = mlb11)
summary(m4)

## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07

Variable 2 - batting avg

scatterplot: Also seems to show a good linear relationship between bat_avg and runs.

plot(mlb11$bat_avg, mlb11$runs, 
     pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#cc0000", # red
     main = "Relationship between bat_avg & runs", 
     xlab = "bat_avg", 
     ylab = "runs")

correlation coefficient: 0.809 or 0.81 is stronger than both at_bats and slighty stronger than hits.

cor(mlb11$runs, mlb11$bat_avg)

## [1] 0.8099859

scatterplot with fitted line

mlb11 %>%
        ggplot()+
        geom_point(mapping = aes(x = bat_avg, y = runs), color = "red")+
        geom_smooth(mapping = aes(x = bat_avg, y = runs),method = 'lm',se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

Linear model:R^2 - 0.65 or 65% of the variability in runs is explained by the variable bat_avg.

m5 <- lm(runs ~ bat_avg, data = mlb11)
summary(m5)

## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

Variable 3 - strikeouts

scatterplot: There appears to be a small degress of negative or inverse relationship between strikeouts and runs.

plot(mlb11$strikeouts, mlb11$runs, 
     pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#cc0000", # red
     main = "Relationship between strikeouts & runs", 
     xlab = "strikeouts", 
     ylab = "runs")

correlation coefficient: The correlation coefficient of -0.41 supports the assertion by the scatterplot of a slightly negative correlation between strikeouts and runs.

cor(mlb11$runs, mlb11$strikeouts)

## [1] -0.4115312

scatterplot with fitted line

mlb11 %>%
        ggplot()+
        geom_point(mapping = aes(x = strikeouts, y = runs), color = "red")+
        geom_smooth(mapping = aes(x = strikeouts, y = runs),method = 'lm',se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

Linear model R^2 - 0.169 or 16.9% of the variability in runs is explained by the variable strikeouts.

m6 <- lm(runs ~ strikeouts, data = mlb11)
summary(m6)

## 
## Call:
## lm(formula = runs ~ strikeouts, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -132.27  -46.95  -11.92   55.14  169.76 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1054.7342   151.7890   6.949 1.49e-07 ***
## strikeouts    -0.3141     0.1315  -2.389   0.0239 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared:  0.1694, Adjusted R-squared:  0.1397 
## F-statistic: 5.709 on 1 and 28 DF,  p-value: 0.02386

Variable 4 - wins

scatterplot : shows a positive relationship between wins and runs.

plot(mlb11$wins, mlb11$runs, 
     pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#cc0000", # red
     main = "Relationship between wins & runs", 
     xlab = "wins", 
     ylab = "runs")

correlation coefficient: moderately strong correlation coefficient of 0.60

cor(mlb11$runs, mlb11$wins)

## [1] 0.6008088

scatterplot with fitted line

mlb11 %>%
        ggplot()+
        geom_point(mapping = aes(x = wins, y = runs), color = "red")+
        geom_smooth(mapping = aes(x = wins, y = runs),method = 'lm',se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

Linear model: 36.1% of the variability in runs can be explained by the variable wins.

m7 <- lm(runs ~ wins, data = mlb11)
summary(m7)

## 
## Call:
## lm(formula = runs ~ wins, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -145.450  -47.506   -7.482   47.346  142.186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  342.121     89.223   3.834 0.000654 ***
## wins           4.341      1.092   3.977 0.000447 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared:  0.361,  Adjusted R-squared:  0.3381 
## F-statistic: 15.82 on 1 and 28 DF,  p-value: 0.0004469

Variable 5 - homeruns

scatterplot : positive linear relationship between homeruns and runs.

plot(mlb11$homeruns, mlb11$runs, 
     pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#cc0000", # red
     main = "Relationship between homeruns & runs", 
     xlab = "homeruns", 
     ylab = "runs")

correlation coefficient: Stron correlation coefficient of 0.79 pts

cor(mlb11$runs, mlb11$homeruns)

## [1] 0.7915577

scatterplot with fitted line

mlb11 %>%
        ggplot()+
        geom_point(mapping = aes(x = homeruns, y = runs), color = "red")+
        geom_smooth(mapping = aes(x = homeruns, y = runs),method = 'lm',se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

Linear model .626 or 63% of the variability in runs is explained by the variable homeruns.

m8 <- lm(runs ~ homeruns, data = mlb11)
summary(m8)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

#4: The statistics: on base %, slugging %, and on base slugging used by Moneyball are more effective in predicting runs than the the old variables.

Based on all the predictors analysed on based + slugging “new_obs” is the best predictor of runs.

Variable 1 - on base

scatterplot: seems to have a strong relationship between new_onbase & runs.

plot(mlb11$new_onbase, mlb11$runs, 
     pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#cc0000", # red
     main = "Relationship between new_onbase & runs", 
     xlab = "new_onbase", 
     ylab = "runs")

correlation coefficient: very strong correlation coefficient of 0.92

cor(mlb11$runs, mlb11$new_onbase)

## [1] 0.9214691

scatterplot with fitted line: residuals very close to the regression line

mlb11 %>%
        ggplot()+
        geom_point(mapping = aes(x = new_onbase, y = runs), color = "red")+
        geom_smooth(mapping = aes(x = new_onbase, y = runs),method = 'lm',se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

Linear model: .849 or 85% of the variability in runs is explained by the variable new_onbase

m9 <- lm(runs ~ new_onbase, data = mlb11)
summary(m9)

## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13

Variable 2 - slugging %

scatterplot: strong relationship between new_slug and runs

plot(mlb11$new_slug, mlb11$runs, 
     pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#cc0000", # red
     main = "Relationship between new_slug & runs", 
     xlab = "new_slug", 
     ylab = "runs")

correlation coefficient: .95 very strong!

cor(mlb11$runs, mlb11$new_slug)

## [1] 0.9470324

scatterplot with fitted line:

mlb11 %>%
        ggplot()+
        geom_point(mapping = aes(x = new_slug, y = runs), color = "red")+
        geom_smooth(mapping = aes(x = new_slug, y = runs),method = 'lm',se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

Linear model:.896 or 90% of the variability in runs is explained by the variable new_slug

m10 <- lm(runs ~ new_slug, data = mlb11)
summary(m10)

## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15

Variable 3 - on base slugging

scatterplot: strong correlation between new_obs and runs.

plot(mlb11$new_obs, mlb11$runs, 
     pch = 19,        # solid circle
     cex = .9,       # make 150% size
     col = "#cc0000", # red
     main = "Relationship between new_obs & runs", 
     xlab = "new_obs", 
     ylab = "runs")

correlation coefficient: strong correlation coefficient of .96

cor(mlb11$runs, mlb11$new_obs)

## [1] 0.9669163

scatterplot with fitted line

mlb11 %>%
        ggplot()+
        geom_point(mapping = aes(x = new_obs, y = runs), color = "red")+
        geom_smooth(mapping = aes(x = new_obs, y = runs),method = 'lm',se=FALSE)

## `geom_smooth()` using formula 'y ~ x'

Linear model R^2 - 0.934 or 93.4% of the variability in runs is explained by the variable new_obs.

m11 <- lm(runs ~ new_obs, data = mlb11)
summary(m11)

## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

5:Prediction & Prediction errors + diagnostics

checking for linerarity

plot(mlb11$runs ~ mlb11$new_obs)
abline(m11)

plot(m11$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

checking for normality of residuals.. approx normal

hist(m11$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals)

Lab 8 - LInear Regression

Will Lopez

11/23/2020

Introduction to Linear Regression

2011 Baseball Season data

Exercise 1 - Plotting the relationship between at_bats and runs. This relationship does look linear.

Correlation coefficient for the realtionship between at_bats & runs:0.61

1 indicates a strong positive relationship.

Sum of squared residuals

Exercise 2: Description of the relationship of the 2 variables: form, direction, and strength.

The data appears to be skewed right and to have a positive-moderate exponentail relationship.

Note that there are 30 residuals, one for each of the 30 observations.

Recall that the residuals are the difference between the observed values and the values predicted by the line

The most common way to do linear regression is to select the line that minimizes the sum of squared residuals.

To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE.

Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.

Exercise 3: Using plot_ss, choose a line that does a good job of minimizing the sum of squares.

Run the function several times. What was the smallest sum of squares that you got? Sum of Squares: 139224.4

The Linear Model

It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error.

Instead we can use the lm function in R to fit the linear model (a.k.a. regression line)

The first argument in the function lm is a formula that takes the form y ~ x.

Here it can be read that we want to make a linear model of runs as a function of at_bats.

The second argument specifies that R should look in the mlb11 data frame to find the runs and at_bats variables.

The output of lm is an object that contains all of the information we need about the linear model that was just fit.

We can access this information using the summary function.

Let’s consider this output piece by piece. First, the formula used to describe the model is shown at the top.

After the formula you find the five-number summary of the residuals. The “Coefficients” table shown next is key;

its first column displays the linear model’s y-intercept and the coefficient of at_bats.

With this table, we can write down the least squares regression line for the linear model:

ŷ =−2789.2429+0.6305∗atbats

One last piece of information we will discuss from the summary output is the Multiple R-squared, or more simply, R^2.

The R^2 value represents the proportion of variability in the response variable that is explained by the explanatory variable.

For this model, 37.3% of the variability in runs is explained by at-bats.

Exercise 4: Fit a new model that uses homeruns to predict runs.

Using the estimates from the R output, write the equation of the regression line.

What does the slope tell us in the context of the relationship between success of a team and its home runs?

ŷ =415.2389+1.8345∗homeruns, for this model 62.66% of the variability in runs is explained by homeruns. For homerun, runs increase 1.83 times.

Prediction and Prediction errors

Let’s create a scatterplot with the least squares line laid on top.

The function abline plots a line based on its slope and intercept. Here, we used a shortcut by providing the model m1, which contains both parameter estimates.

This line can be used to predict y at any value of x. When predictions are made for values of x that are beyond the range of the observed data, it is referred to

as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.

Exercise 5: If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats?

Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

Model diagnostics: To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot.

We should also verify this condition with a plot of the residuals vs. at-bats.

Exercise 6: Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

Nearly normal residuals: To check this condition, we can look at a histogram

or a normal probability plot of the residuals.

Exercise 7: Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met? yes, both depict close to normal distribution,

with thinner tails than those of a normal distribution.

Constant variability:

Exercise 8: Based on the plot in (1), does the constant variability condition appear to be met? Yes, there seems to be constant variability between the at_bats and runs.

On your OWN:

#1 - Plotting the relationship between stolen_bases and runs. This doesnt show a very good linear relationship. The correlation coefficient is only 0.054.

#2

For this model, 0.29% of the variability in runs is explained by stolen_bases, compared to 37.3% explained by at_bats. This model also has a lower correlation coefficient than

at_bats. Therefore, the variable at_bats is a better predictor of runs than it is stolen_bases.

#3: Based on the graphical and numerical methods observed for the variables the variable that it is a better predictor of runs is the batting avg.

Variable 1 - hits

scatterplot: does show a good exponential linear relationship.

correlation coefficient: 0.801..stronger than at_bats

scatterplot with fitted line

Linear model: R^2 - 0.64 or 64% of the variability in runs is explained by the variable hits.

Variable 2 - batting avg

scatterplot: Also seems to show a good linear relationship between bat_avg and runs.

correlation coefficient: 0.809 or 0.81 is stronger than both at_bats and slighty stronger than hits.

scatterplot with fitted line

Linear model:R^2 - 0.65 or 65% of the variability in runs is explained by the variable bat_avg.

Variable 3 - strikeouts

scatterplot: There appears to be a small degress of negative or inverse relationship between strikeouts and runs.

correlation coefficient: The correlation coefficient of -0.41 supports the assertion by the scatterplot of a slightly negative correlation between strikeouts and runs.

scatterplot with fitted line

Linear model R^2 - 0.169 or 16.9% of the variability in runs is explained by the variable strikeouts.

Variable 4 - wins

scatterplot : shows a positive relationship between wins and runs.

correlation coefficient: moderately strong correlation coefficient of 0.60

scatterplot with fitted line