download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
summary(mlb11)
## team runs at_bats hits
## Arizona Diamondbacks: 1 Min. :556.0 Min. :5417 Min. :1263
## Atlanta Braves : 1 1st Qu.:629.0 1st Qu.:5448 1st Qu.:1348
## Baltimore Orioles : 1 Median :705.5 Median :5516 Median :1394
## Boston Red Sox : 1 Mean :693.6 Mean :5524 Mean :1409
## Chicago Cubs : 1 3rd Qu.:734.0 3rd Qu.:5575 3rd Qu.:1441
## Chicago White Sox : 1 Max. :875.0 Max. :5710 Max. :1600
## (Other) :24
## homeruns bat_avg strikeouts stolen_bases
## Min. : 91.0 Min. :0.2330 Min. : 930 Min. : 49.00
## 1st Qu.:118.0 1st Qu.:0.2447 1st Qu.:1085 1st Qu.: 89.75
## Median :154.0 Median :0.2530 Median :1140 Median :107.00
## Mean :151.7 Mean :0.2549 Mean :1150 Mean :109.30
## 3rd Qu.:172.8 3rd Qu.:0.2602 3rd Qu.:1248 3rd Qu.:130.75
## Max. :222.0 Max. :0.2830 Max. :1323 Max. :170.00
##
## wins new_onbase new_slug new_obs
## Min. : 56.00 Min. :0.2920 Min. :0.3480 Min. :0.6400
## 1st Qu.: 72.00 1st Qu.:0.3110 1st Qu.:0.3770 1st Qu.:0.6920
## Median : 80.00 Median :0.3185 Median :0.3985 Median :0.7160
## Mean : 80.97 Mean :0.3205 Mean :0.3988 Mean :0.7191
## 3rd Qu.: 90.00 3rd Qu.:0.3282 3rd Qu.:0.4130 3rd Qu.:0.7382
## Max. :102.00 Max. :0.3490 Max. :0.4610 Max. :0.8100
##
I’d use a scatter plot
library(ggplot2)
ggplot(mlb11,aes(x=mlb11$at_bats,y=mlb11$runs))+geom_point(stat="identity",fill="red",colour="dark red")
These data look linear but with a decent but of noise.
cor(x=mlb11$at_bats,y=mlb11$runs)
## [1] 0.610627
It has a relatively high correlation coefficient for this number of data. It seems to be a positive relationship.
The smallest sum of squares I got was 149683.9
m1 <- lm(runs ~ at_bats, data = mlb11)
m2<-lm(runs~homeruns,data=mlb11)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
Y= 415.2 + 1.8345*homeruns
Positive relationship
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)
Predicted runs ~ -2789.2+.6305*5578 = 727.7 runs
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)
abline(v=5578)
Residual~ -20 so it is an overestimate
Doesn’t appear to be a pattern It indicated possible linearity with noise.
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)
Nearly normal with some stepping maybe. The nearly normal residuals does appear to be met.
Constant variability condition on the QQ plot appears to be met as there are relatively similar pockets of variability. Nothing extreme or noteworthy.
Batting average
plot(x=mlb11$bat_avg,y=mlb11$runs)
At first glace this appears very linear.
m3<-lm(runs~bat_avg,data=mlb11)
summary(m3)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
R^2 = 0.65 so this performs better than the at_bats model.
m4<-lm(runs~hits,data=mlb11)
summary(m4)
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.718 -27.179 -5.233 19.322 140.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.5600 151.1806 -2.484 0.0192 *
## hits 0.7589 0.1071 7.085 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6292
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.043e-07
m5<-lm(runs~strikeouts,data=mlb11)
summary(m5)
##
## Call:
## lm(formula = runs ~ strikeouts, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132.27 -46.95 -11.92 55.14 169.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1054.7342 151.7890 6.949 1.49e-07 ***
## strikeouts -0.3141 0.1315 -2.389 0.0239 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.5 on 28 degrees of freedom
## Multiple R-squared: 0.1694, Adjusted R-squared: 0.1397
## F-statistic: 5.709 on 1 and 28 DF, p-value: 0.02386
m6<-lm(runs~wins,data=mlb11)
summary(m6)
##
## Call:
## lm(formula = runs ~ wins, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -145.450 -47.506 -7.482 47.346 142.186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 342.121 89.223 3.834 0.000654 ***
## wins 4.341 1.092 3.977 0.000447 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.1 on 28 degrees of freedom
## Multiple R-squared: 0.361, Adjusted R-squared: 0.3381
## F-statistic: 15.82 on 1 and 28 DF, p-value: 0.0004469
m7<-lm(runs~stolen_bases,data=mlb11)
summary(m7)
##
## Call:
## lm(formula = runs ~ stolen_bases, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -139.94 -62.87 10.01 38.54 182.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 677.3074 58.9751 11.485 4.17e-12 ***
## stolen_bases 0.1491 0.5211 0.286 0.777
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 83.82 on 28 degrees of freedom
## Multiple R-squared: 0.002914, Adjusted R-squared: -0.0327
## F-statistic: 0.08183 on 1 and 28 DF, p-value: 0.7769
At bats: Rsq = .37 Bat avg: Rsq = .65 homeruns: Rsq =.63 hits: Rsq = .64 strikeouts: Rsq= .17 wins: Rsq = .36 Stolen: rsq = 0.003
At bats best predicts runs.
plot(mlb11$runs ~ mlb11$bat_avg)
abline(m3)
New variables
m8<-lm(runs~new_onbase,data=mlb11)
summary(m8)
##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.270 -18.335 3.249 19.520 69.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118.4 144.5 -7.741 1.97e-08 ***
## new_onbase 5654.3 450.5 12.552 5.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8437
## F-statistic: 157.6 on 1 and 28 DF, p-value: 5.116e-13
plot(mlb11$runs ~ mlb11$new_onbase)
abline(m8)
m9<-lm(runs~new_slug,data=mlb11)
summary(m9)
##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.80 68.71 -5.47 7.70e-06 ***
## new_slug 2681.33 171.83 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15
plot(mlb11$runs ~ mlb11$new_slug)
abline(m9)
m10<-lm(runs~new_obs,data=mlb11)
summary(m10)
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
plot(mlb11$runs ~ mlb11$new_obs)
abline(m10)
OnBase: rsq = .85 Slug: rsq=0.89 Obs: rsq=.934
These variables explain a lot more of the variance. They are better predictors based on the graphs and RSQ.
Model diagnostics for OBS:
hist(m10$residuals)
qqnorm(m10$residuals)
qqline(m10$residuals)
Histogram nearly normal. Very little variance on the qq plot implying the variability condition is met.