Exercise 1

What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?

I would use the scatterplot to see if there is a clear relationship between at_bats and runs. The data looks fairly linear but it’s not conclusive. A deeper investigation or more data is need to determine if at_bats is a good predictor for runs. I would not be comfortable with predicting the number of runs from at_bats. I would be comfortable with giving a ballpark estimate though. (pun intended)

p <-  ggplot(mlb11, aes(x = at_bats, y = runs)) +
      geom_point()
p 

cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627

Exercist 2

Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

The scatterplot has a form that is loosely scattered. The direction looks linear in the upper-right direction as the number of at_bats increases, the number of runs increases. The relationship between these values is apparent but not strong enough to determine a clear dependency. There may be a couple of outliers but nothing unusual to cause any concern.

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = T)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Exercise 3

Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

The plot_ss would not work for me in the rmarkdown. I had to run it directly from the console. I ran it multiple times and the smallest sum of squares I got was 125342.5.

m1 <- lm(runs ~ at_bats, data = mlb11)
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Exercise 4

Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

\(\hat y = 415.239 + 1.835 * homeruns\)

The slope tells us that on average, a homerun equals 1.8 runs. In this context, a team is more successful if the ratio is greater than this, or the residual is positive.

plot_ss(x = mlb11$homeruns, y = mlb11$runs, showSquares = T)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     415.239        1.835  
## 
## Sum of Squares:  73671.99
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

Exercise 5

If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

Not sure if I’m understanding this correctly. The prediction or expected value would be 728 runs. The Phillies had 5579 at bats and had 713 runs. Compared to this, the prediction would be over-estimating with a residual of 15 runs.

run <- function(x){
  y <- -2789.2429 + (x)*0.6305
  return(y)
}
run(5578)
## [1] 727.6861
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

Exercise 6

Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

I see a broad pattern. There isn’t a clear skew above or below the line. The more compacted they are near the zero line, the better fit the linear model is to our data. It’s another example of why at_bats may not be the best predictor, or should be the only predictor or runs.

hist(m1$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals) 

Exercise 7

Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

Yes, the distribution or the residuals is normal where the majority are based around the mean of zero. There does not seem to be any skew visible.

Exercise 8

Based on the plot in (1), does the constant variability condition appear to be met?

The first quartile is -47.05 and third quartile is 54.40. The differences from zero are roughly the same and the variability around the zero line are constant as well so the condition appears to be met.

On Your Own

1

It seems the best predictor would be ‘hits’. There seems to be a decisive linear relationship between these two variables.

plot(mlb11$hits, mlb11$runs)
l <- lm(runs~hits, mlb11)
abline(l)

2

The \(R^2\) value for runs ~ at_bats is 0.37 and 0.64 for runs ~ hits. This means that the linear model for runs ~ at_bats only covers 37% of the variance of the data where runs ~ hits covers 64% of the variance.

summary(l)
## 
## Call:
## lm(formula = runs ~ hits, data = mlb11)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## hits           0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07
summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

3

Bat_avg is the best predictor for runs. The data is linear, the inner-quartile range for the residuals is a consistent distance from zero, and the \(R^2\) is a high 0.65 which means the linear model covers a lot of the variance of the data.

plot(mlb11$bat_avg, mlb11$runs)
l <- lm(runs~bat_avg, mlb11)
abline(l)

summary(l)
## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

4

These new variables are a lot better predictors. They are almost like multiple regression models because they take into account more factors. For example, instead of just recorded the amount of hits a player has, they record the amount of bases he gets after the hit. If a player gets a hit and then reaches third base, there is a higher probability he would score than if he only reached first base.

I used geom_smooth to show the least squares regression line. These have a more linear relationship than the other varaibles. The \(R^2\) value is also much higher, highest is 0.93.

p1 <- ggplot(mlb11, aes(x = new_onbase, y = runs)) +
      geom_point() +
      xlab("On-Base Percentage") +
      ylab("Runs") +
      geom_smooth(method = "lm", se = F)

p2 <- ggplot(mlb11, aes(x = new_slug, y = runs)) +
      geom_point() +
      xlab("Slug Percentage") +
      ylab("Runs") +
      geom_smooth(method = "lm", se = F)

p3 <- ggplot(mlb11, aes(x = new_obs, y = runs)) +
      geom_point() +
      xlab("On-Base + Slugging") +
      ylab("Runs") +
      geom_smooth(method = "lm", se = F)

grid.arrange(p1, p2, p3, ncol = 2)

Summary for New_OnBase

plot(mlb11$new_onbase, mlb11$runs)
l <- lm(runs~new_onbase, mlb11)
abline(l)

summary(l)
## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13

Summary for New_Slug

plot(mlb11$new_slug, mlb11$runs)
l <- lm(runs~new_slug, mlb11)
abline(l)

summary(l)
## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15

Summary for New_OBS

plot(mlb11$new_obs, mlb11$runs)
l <- lm(runs~new_obs, mlb11)
abline(l)

summary(l)
## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

5

From the lm summary, the ‘new_obs’ variable is the best predictor. The mean residual is very close to zero, the \(R^2\) value is the highest at 0.93 and the p-value is extremely small, meaning it plays an important role as a predictor to runs.

l <- lm(mlb11$runs ~ mlb11$new_obs)
hist(l$residuals, breaks = 25)

plot(l$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

qqnorm(l$residuals)
qqline(l$residuals)

summary(l)
## 
## Call:
## lm(formula = mlb11$runs ~ mlb11$new_obs)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -686.61      68.93  -9.962 1.05e-10 ***
## mlb11$new_obs  1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16
mean(l$residuals)
## [1] 2.181126e-16