Excercises

Excercise 1

ggplot(mlb11) + geom_point(aes(x = at_bats, y = runs)) + geom_smooth(aes(x = at_bats, y = runs ), method = "lm")

We use a scatter plot for the data. Runs does appear to increase at a constant level as at bats increases. The relationship appears to be linear enough for a linear regression to be appropriate.

Excercise 2

There is a medium strong positive correlation between at bats and runs. There is not enough curvature to need a quadratic term. There is a major outlier at about 5520 at bats and 870 runs. This will exhibit some leverage because of it’s high residual, though the predictor is near the mean, so it won’t be an influential point. The teams with over 5700 at bats and 800 runs will be somewhat influential because they are relatively far from the fitted line and far from the mean of at bats. The same can be said for the teams around 5400 at bats and 550 runs.

Excercise 3

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Excercise 4

model <- lm(runs ~ homeruns, data = mlb11)
summary(model)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

\[runs\quad =\quad 1.8345\quad *\quad homeruns\quad +\quad 415.23\]

For every additional homerun a team hits in a season, we can expect them to score an additional 1.8 runs

Excercise 5

runs_predicted <- 0.6305 * 5578 - 2789.2429 
runs_predicted
## [1] 727.6861
residual <- runs_predicted - 713 #closest match
residual
## [1] 14.6861

It appears to be an over estimate given that the closest match for at bats scored 713 runs. However teams near that amount had run totals of 708, 713, and 787, which would average out to about the amount predicted by the model. I believe saying that a model is an overestime based on one point estimate is a logical error.

Excercise 6

m1 <- lm(runs ~ at_bats, data = mlb11)
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)

When examening the residual plot, there does not seem to be any quadratic or higher order pattern. This indicates a linear relationship.

Excercise 7

hist(m1$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals)

It’s a little difficult to judge normality with only 30 samples, but there doesn’t appear to be too much deviation from normality. There are no major outliers that would indicate heavy tails.

Excercise 8

Constant variability is also met, as there isn’t much of a difference in the spread of residuals between various values of the predictor variable.

On Your Own

Problem 1

ggplot(mlb11) + geom_point(aes(x = hits, y = runs)) + geom_smooth(aes(x = hits, y = runs ), method = "lm")

There appears to be a strong linear relationship between hits and runs. Many of the points lie close to the line, and there doesn’t appear to be any curved trend.

Problem 2

model2 <- lm(runs ~ hits, data = mlb11)
summary(model)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

The R squared is twice as large as at bats’ R squared. The plot looked better, and the t value is higher. This appears to be a better fit.

Problem 3

model <- lm(runs ~ bat_avg, data = mlb11)

summary(model)
## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08
plot(mlb11$runs, mlb11$bat_avg)
abline(model)

Problem 4

The newer statistics were all much better predictors of runs. The best was OPS

ggplot(mlb11) + geom_point(aes(x = new_obs, y = runs)) + geom_smooth(aes(x = new_obs, y = runs ), method = "lm")

model <- lm(runs ~ new_obs, data = mlb11)

summary(model)
## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

This makes sense because OPS combines the power measuring ability of slugging percentage with on base percentage. This is the best model fit I have ever seen, even with contrived exercises

Probem 5

ggplot(mlb11, aes(new_obs, y = model$residuals)) + geom_point() +
  geom_hline(yintercept = 0)

ggplot(mlb11) + geom_histogram(aes(x = model$residuals), fill = "darkblue", color = "black", binwidth = 10)

qqnorm(model$residuals)
qqline(model$residuals)

The model diagnostics confirm the assumptions are satisfied.