Grando 7 Lab

if (Sys.info()["sysname"] == "Windows") {
    setwd("~/Masters/DATA606/Week7/Lab/Lab7")
} else {
    setwd("~/Documents/Masters/DATA606/Week7/Lab/Lab7")
}
load("more/mlb11.RData")
require(ggplot2)
## Loading required package: ggplot2

Exercise 1 - What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?

Answer:

Since runs are also a numerical variable, I would use a scatter plot to display each data point.

ggplot(mlb11, aes(y = at_bats, x = runs)) + geom_point() + geom_smooth(method = lm, 
    fullrange = TRUE)

to determine whether the linear regression line is a good fit for runs vs. at_bats, we should look at the correlation coefficient:

cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627
cor(mlb11$runs, mlb11$at_bats)^2
## [1] 0.3728654

The r-squared results for the linear regression line is 0.37 which means that this model accounts for approximately 37% of the variance, which does not indicate a good fit. The residual plot does not seem to show a pattern but, given the r-squared value, this still does not appear to be a good fit.

e1.lm <- lm(runs ~ at_bats, data = mlb11)
summary(e1.lm)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388
residuals <- resid(e1.lm)
residuals <- data.frame(cbind(mlb11$at_bats, residuals))
names(residuals) <- c("at_bats", "residuals")
ggplot(residuals, aes(y = residuals, x = at_bats)) + geom_point() + 
    geom_hline(yintercept = 0)

Exercise 2 -Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

Answer:

There appears to be a positive weak linear relationship between runs and at bats. I have stated that the relationship is postitive since the regression line has a positive slope. I have indicated it to be a weak relationship given the r-squared value, and it appears that a linear relationship is the best fit since there were no obvious patterns in the residuals plot.

Exercise 3 - Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

Answer:

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

The best run I was able to produce is summarized below:

Call: lm(formula = y ~ x, data = pts)

Coefficients: (Intercept) x
-4284.0472 0.9026

Sum of Squares: 139329.3

Exercise 4 - Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

Answer:

m1 <- lm(runs ~ homeruns, data = mlb11)
summary(m1)
## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

The equation for the least squares regression line is:

\[runs = 415.24 + 1.8 * homeruns\]

The slope of the line is 1.8 which means that for every homerun, the model expects to see 1.8 runs scored.

Exercise 5 - If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

Answer:

m1 <- lm(runs ~ at_bats, data = mlb11)
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

summary(m1)
## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388
m1.lm <- summary(m1)
m1_int <- unname(coefficients(m1.lm))[[1]]
m1_slope <- unname(coefficients(m1.lm))[[2]]
m1_predict <- m1_int + 5578 * m1_slope
m1_data <- mlb11[which(mlb11$at_bats == 5579), "runs"]

Since the intercept is -2789.24 and the slopw is 0.63, the linear regression line would predict 727.96 runs from 5.578 at bats. The closest data point to 5,578 at bats is 5579 and returns 713. Therefore the prediction is an overstimate by 14.96.

Exercise 6 - Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

Answer:

There does not appear to be a pattern in the residuals plot. This indicates that a linear regression may be the best predictor for these data points since there does not appear to be a obvious non-linear relationship.

Exercise 7 - Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

Answer:

plot(m1$residuals ~ mlb11$at_bats)
m1_alt <- data.frame(cbind(mlb11$at_bats, m1$residuals))
names(m1_alt) <- c("at_bats", "residuals")
abline(h = 0, lty = 3)

hist(m1$residuals)

qqnorm(m1$residuals)
qqline(m1$residuals)

yes, the histogram appears to have a normal shape. However, it appears that the normal probability plot shows that the data may have short tails which means that the data could be narrower than the shape of a normal distribution.

Exercise 8 - Based on the plot in (1), does the constant variability condition appear to be met?

Answer:

Yes, the variability appears to be the same regardless of the number of at bats.

Question 1 - Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

Answer:

I will pick bat_avg to compare to runs since a team that registers more hits (higher batting average) would seem to score more runs

ggplot(mlb11, aes(y = runs, x = bat_avg)) + geom_point() + geom_smooth(method = lm, 
    fullrange = TRUE)

bat.cor <- cor(mlb11$bat_avg, mlb11$runs)^2
lm.bat <- lm(runs ~ bat_avg, data = mlb11)
summary(lm.bat)
## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08
residuals <- resid(lm.bat)
residuals <- data.frame(cbind(mlb11$bat_avg, residuals))
names(residuals) <- c("bat_avg", "residuals")
ggplot(residuals, aes(y = residuals, x = bat_avg)) + geom_point() + 
    geom_hline(yintercept = 0)

Yes, given the scatter plot and initial correlation coeefficient of 0.66, it appears there is a linear relationship between batting average and runs.

Question 2 - How does this relationship compare to the relationship between runs and at_bats? Use the R22 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?

Answer:

This relationship has a stronger linear correlation than the relationship between runs and at_bats. The r-squared value for at_bats vs. runs was 0.373 whereas the r-squared value for bat_avg vs. runs is 0.656.

Question 3 - Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).

Answer:

# #at_bats m1 <- lm(runs ~ at_bats, data = mlb11)
# cor(mlb11$at_bats, mlb11$runs)^2 summary(m1) ggplot(mlb11,
# aes(y=runs, x=at_bats)) + geom_point() + geom_smooth(method
# = lm, fullrange = TRUE) hist(m1$residuals)
# qqnorm(m1$residuals) qqline(m1$residuals) #hits m2 <-
# lm(runs ~ hits, data = mlb11) cor(mlb11$hits, mlb11$runs)^2
# summary(m2) ggplot(mlb11, aes(y=runs, x=hits)) +
# geom_point() + geom_smooth(method = lm, fullrange = TRUE)
# hist(m2$residuals) qqnorm(m2$residuals)
# qqline(m2$residuals) #homeruns m3 <- lm(runs ~ homeruns,
# data = mlb11) cor(mlb11$homeruns, mlb11$runs)^2 summary(m3)
# ggplot(mlb11, aes(y=runs, x=homeruns)) + geom_point() +
# geom_smooth(method = lm, fullrange = TRUE)
# hist(m3$residuals) qqnorm(m3$residuals)
# qqline(m3$residuals)

# batting average
m4 <- lm(runs ~ bat_avg, data = mlb11)
cor(mlb11$bat_avg, mlb11$runs)^2
## [1] 0.6560771
summary(m4)
## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08
ggplot(mlb11, aes(y = runs, x = bat_avg)) + geom_point() + geom_smooth(method = lm, 
    fullrange = TRUE)

hist(m4$residuals)

qqnorm(m4$residuals)
qqline(m4$residuals)

residuals <- resid(m4)
residuals <- data.frame(cbind(mlb11$bat_avg, residuals))
names(residuals) <- c("bat_avg", "residuals")
ggplot(residuals, aes(y = residuals, x = bat_avg)) + geom_point() + 
    geom_hline(yintercept = 0)

# #strikeouts m5 <- lm(runs ~ strikeouts, data = mlb11)
# cor(mlb11$strikeouts, mlb11$runs)^2 summary(m5)
# ggplot(mlb11, aes(y=runs, x=strikeouts)) + geom_point() +
# geom_smooth(method = lm, fullrange = TRUE)
# hist(m5$residuals) qqnorm(m5$residuals)
# qqline(m5$residuals) #stolen bases m6 <- lm(runs ~
# stolen_bases, data = mlb11) cor(mlb11$stolen_bases,
# mlb11$runs)^2 summary(m6) ggplot(mlb11, aes(y=runs,
# x=stolen_bases)) + geom_point() + geom_smooth(method = lm,
# fullrange = TRUE) hist(m6$residuals) qqnorm(m6$residuals)
# qqline(m6$residuals) #wins m7 <- lm(runs ~ wins, data =
# mlb11) cor(mlb11$wins, mlb11$runs)^2 summary(m7)
# ggplot(mlb11, aes(y=runs, x=wins)) + geom_point() +
# geom_smooth(method = lm, fullrange = TRUE)
# hist(m7$residuals) qqnorm(m7$residuals)
# qqline(m7$residuals)

It appears that batting average predicts runs using linear regression models. The summary of this model is displayed above and all other analyses have been commented out, as requested.

Question 4 - Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?

Answer:

# on base
m8 <- lm(runs ~ new_onbase, data = mlb11)
cor(mlb11$new_onbase, mlb11$runs)^2
## [1] 0.8491053
summary(m8)
## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13
ggplot(mlb11, aes(y = runs, x = new_onbase)) + geom_point() + 
    geom_smooth(method = lm, fullrange = TRUE)

hist(m8$residuals)

qqnorm(m8$residuals)
qqline(m8$residuals)

residuals <- resid(m8)
residuals <- data.frame(cbind(mlb11$new_onbase, residuals))
names(residuals) <- c("new_onbase", "residuals")
ggplot(residuals, aes(y = residuals, x = new_onbase)) + geom_point() + 
    geom_hline(yintercept = 0)

# slug
m9 <- lm(runs ~ new_slug, data = mlb11)
cor(mlb11$new_slug, mlb11$runs)^2
## [1] 0.8968704
summary(m9)
## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15
ggplot(mlb11, aes(y = runs, x = new_slug)) + geom_point() + geom_smooth(method = lm, 
    fullrange = TRUE)

hist(m9$residuals)

qqnorm(m9$residuals)
qqline(m9$residuals)

residuals <- resid(m9)
residuals <- data.frame(cbind(mlb11$new_onbase, residuals))
names(residuals) <- c("new_slug", "residuals")
ggplot(residuals, aes(y = residuals, x = new_slug)) + geom_point() + 
    geom_hline(yintercept = 0)

# obs
m10 <- lm(runs ~ new_obs, data = mlb11)
cor(mlb11$new_obs, mlb11$runs)^2
## [1] 0.9349271
summary(m10)
## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16
ggplot(mlb11, aes(y = runs, x = new_obs)) + geom_point() + geom_smooth(method = lm, 
    fullrange = TRUE)

hist(m10$residuals)

qqnorm(m10$residuals)
qqline(m10$residuals)

residuals <- resid(m10)
residuals <- data.frame(cbind(mlb11$new_obs, residuals))
names(residuals) <- c("new_obs", "residuals")
ggplot(residuals, aes(y = residuals, x = new_obs)) + geom_point() + 
    geom_hline(yintercept = 0)

All three new statistics are indicated to be better predictors (using linear regression) than the best old variable. As can be seen from the summary above, all new variable r-sqaured values are above the best old variable r-squared value and the residual plots show that the regession appears to be appropriate. The best predictor for runs appears to be obs (on-base plus slugging). This makes sense since it is a combination of on-base percentage and slugging, both of which have been shown to have high correlations to runs on their own. While it is not guaranteed that combining the two would have a better correlation, it seems to make sense and the results show that is the case in this instance.

Question 5 - Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

Answer:

Model diagnostics have been provided as a response to question 4.