ANLY 502 Lab 9

Question 1

Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

Answer

plot(mlb11$hits, mlb11$runs)
m2 = (lm( mlb11$runs ~ mlb11$hits))
abline(m2)

summary(m2)

## 
## Call:
## lm(formula = mlb11$runs ~ mlb11$hits)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.718  -27.179   -5.233   19.322  140.693 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -375.5600   151.1806  -2.484   0.0192 *  
## mlb11$hits     0.7589     0.1071   7.085 1.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared:  0.6419, Adjusted R-squared:  0.6292 
## F-statistic:  50.2 on 1 and 28 DF,  p-value: 1.043e-07

Yes, there seems to be a linear relationship.

Question 2

How does this relationship compare to the relationship between runs and at_bats? Use the R2 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?

Answer

Even though both the variables are significant in the model, model with hits has a smaller standard error and a higher R-square, suggesting that hits would explain for the variations in runs.

Question 3

Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we've discussed (for the sake of conciseness, only include output for the best variable, not all five).

Answer

for (i in names(mlb11)){
m = (lm( mlb11$runs ~ mlb11[[i]]))
print(c(i, summary(m)$sigma, summary(m)$r.squared))
}

## [1] "team" "NaN"  "1"

## Warning in summary.lm(m): essentially perfect fit: summary may be unreliable

## Warning in summary.lm(m): essentially perfect fit: summary may be unreliable

## [1] "runs"                 "1.32604277708719e-14" "1"                   
## [1] "at_bats"           "66.4728377356063"  "0.372865390186805"
## [1] "hits"              "50.2276068343207"  "0.641938767239419"
## [1] "homeruns"          "51.2946624623989"  "0.626563569566283"
## [1] "bat_avg"           "49.225978762781"   "0.656077134646863"
## [1] "strikeouts"        "76.501648884868"   "0.169357932236313"
## [1] "stolen_bases"        "83.8166204175066"    "0.00291399266657394"
## [1] "wins"              "67.1002375324046"  "0.360971179446681"
## [1] "new_onbase"        "32.6062532720795"  "0.849105251446139"
## [1] "new_slug"          "26.9560074643486"  "0.896870368409638"
## [1] "new_obs"           "21.4123250562135"  "0.934927126351814"

m = (lm( mlb11$runs ~ mlb11$bat_avg))
summary(m)

## 
## Call:
## lm(formula = mlb11$runs ~ mlb11$bat_avg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -642.8      183.1  -3.511  0.00153 ** 
## mlb11$bat_avg   5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

plot(mlb11$hits, mlb11$bat_avg)
abline(m)

Out of all the variables, bat average has the best prediction with the lowest standard error and highest R-square.

Question 4

Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we've analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?

Answer

According to the summary above, they are more effective predictors with smaller standard error and higher R-square.

Question 5

Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.

Answer

new_obs would be the best predictor for runs.

ANLY 502 Lab 9

Nischal Bondalapati