download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
mlb11
## team runs at_bats hits homeruns bat_avg strikeouts
## 1 Texas Rangers 855 5659 1599 210 0.283 930
## 2 Boston Red Sox 875 5710 1600 203 0.280 1108
## 3 Detroit Tigers 787 5563 1540 169 0.277 1143
## 4 Kansas City Royals 730 5672 1560 129 0.275 1006
## 5 St. Louis Cardinals 762 5532 1513 162 0.273 978
## 6 New York Mets 718 5600 1477 108 0.264 1085
## 7 New York Yankees 867 5518 1452 222 0.263 1138
## 8 Milwaukee Brewers 721 5447 1422 185 0.261 1083
## 9 Colorado Rockies 735 5544 1429 163 0.258 1201
## 10 Houston Astros 615 5598 1442 95 0.258 1164
## 11 Baltimore Orioles 708 5585 1434 191 0.257 1120
## 12 Los Angeles Dodgers 644 5436 1395 117 0.257 1087
## 13 Chicago Cubs 654 5549 1423 148 0.256 1202
## 14 Cincinnati Reds 735 5612 1438 183 0.256 1250
## 15 Los Angeles Angels 667 5513 1394 155 0.253 1086
## 16 Philadelphia Phillies 713 5579 1409 153 0.253 1024
## 17 Chicago White Sox 654 5502 1387 154 0.252 989
## 18 Cleveland Indians 704 5509 1380 154 0.250 1269
## 19 Arizona Diamondbacks 731 5421 1357 172 0.250 1249
## 20 Toronto Blue Jays 743 5559 1384 186 0.249 1184
## 21 Minnesota Twins 619 5487 1357 103 0.247 1048
## 22 Florida Marlins 625 5508 1358 149 0.247 1244
## 23 Pittsburgh Pirates 610 5421 1325 107 0.244 1308
## 24 Oakland Athletics 645 5452 1330 114 0.244 1094
## 25 Tampa Bay Rays 707 5436 1324 172 0.244 1193
## 26 Atlanta Braves 641 5528 1345 173 0.243 1260
## 27 Washington Nationals 624 5441 1319 154 0.242 1323
## 28 San Francisco Giants 570 5486 1327 121 0.242 1122
## 29 San Diego Padres 593 5417 1284 91 0.237 1320
## 30 Seattle Mariners 556 5421 1263 109 0.233 1280
## stolen_bases wins new_onbase new_slug new_obs
## 1 143 96 0.340 0.460 0.800
## 2 102 90 0.349 0.461 0.810
## 3 49 95 0.340 0.434 0.773
## 4 153 71 0.329 0.415 0.744
## 5 57 90 0.341 0.425 0.766
## 6 130 77 0.335 0.391 0.725
## 7 147 97 0.343 0.444 0.788
## 8 94 96 0.325 0.425 0.750
## 9 118 73 0.329 0.410 0.739
## 10 118 56 0.311 0.374 0.684
## 11 81 69 0.316 0.413 0.729
## 12 126 82 0.322 0.375 0.697
## 13 69 71 0.314 0.401 0.715
## 14 97 79 0.326 0.408 0.734
## 15 135 86 0.313 0.402 0.714
## 16 96 102 0.323 0.395 0.717
## 17 81 79 0.319 0.388 0.706
## 18 89 80 0.317 0.396 0.714
## 19 133 94 0.322 0.413 0.736
## 20 131 81 0.317 0.413 0.730
## 21 92 63 0.306 0.360 0.666
## 22 95 72 0.318 0.388 0.706
## 23 108 72 0.309 0.368 0.676
## 24 117 74 0.311 0.369 0.680
## 25 155 91 0.322 0.402 0.724
## 26 77 89 0.308 0.387 0.695
## 27 106 80 0.309 0.383 0.691
## 28 85 86 0.303 0.368 0.671
## 29 170 71 0.305 0.349 0.653
## 30 125 67 0.292 0.348 0.640
I would use scatterplot to display the relationship between runs and one of the other numerical variables.
plot(mlb11$runs ~ mlb11$at_bats, main = "Relationship between Runs and atBats", xlab = "At Bats", ylab = "Runs")
The relationship looks moderately linear but not strong enough to be able to comfortably use a linear model to predict the number of runs.
Since the relationship is linear we can quanitfy the strength of the relationship with the correlation coefficient
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627
The relationship between runs and at bats can be considered positive but moderately strong as the correlation coefficient 0.610627 turns out to be far below from +1.
we can also clearly spot several positive outliers in the plot such as a team with 5518 and 5600 at bats.
The smallest sum of squares that i got after running plot_ss function several times is 125153 with the coefficients x -> 0.5882 Intercept -> -2549.4628
The neigboring value deviate from the smallest value by around 4000 - 5000
plot(mlb11$runs ~ mlb11$at_bats, main = "Relationship between Runs and Home runs", xlab = "Home Runs", ylab = "Runs")
Correlation Coefficient
cor(mlb11$runs, mlb11$homeruns)
## [1] 0.7915577
m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
Equation of the regression line for the relationship between Run and Home Runs
y^ = 415.2389 + 1.8345 * homeruns
By looking at the plot we can say that the relationship between runs and home runs is linear positive and relatively strong as the correlation coefficient 0.7916 is closer to +1
Least Square Regression line for runs vs at_bats
y^ = -2789.2429 + 0.6305 * atbats
If atbats is 5,578
Predicted Runs is y^ = -2789.2429 + 0.6305 * 5578 y^ = 727.6861
The estimated number of runs for 5578 at bats based on the linear regression formula above is 728. A team with 5578 at bats cannot be found in the data but we can see the team Philadelphia Phillies has 5579 at bats with 713 runs. Therefore we can conclude that the model may have overestimated the runs for a team with 5578 at bats by 728 - 713 = 15 runs.
m1 <- lm(runs ~ at_bats, data = mlb11)
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)
Based on the plot we can clearly say that there is no apparent pattern in the distribution as the numbers appear to be scattered unevenly around the dashed line and appear to be skewed. But it can be considered as a linear relationship.
m1 <- lm(runs ~ at_bats, data = mlb11)
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals)
Looking at the histogram and the plot I would say that the nearly normal residuals condition has been met.
The variation of points around the least squares line appear to be reasonably constant thus an inference can be made that the constant variability condition has been met.
Lets us take bat_avg as the predictor variable as I think it might also be a good predictor of runs.
plot(mlb11$runs ~ mlb11$bat_avg, main = "Relationship between Runs and Batting Avg", xlab = "Batting Avg", ylab = "Runs")
m3 <- lm(runs ~ bat_avg, data = mlb11)
abline(m3)
Correlation Coefficient:
cor(mlb11$runs, mlb11$bat_avg)
## [1] 0.8099859
Linear Regression Line Formula: y^ = -642 + 5242.2 * bat_avg
Based on the plot, linear model statistics and correlation coefficient for the relationship between runs and batting average it is evident that the relationship is positive, linear and relatively strong.
R2 is the percentage of the variance in the dependent variable that can be explained by a linear model. R2 is always in the range between 0% - 100% and the higher the value the better the linear model explains the dependant variable and lower the value weaker the predictability of the dependant variable.
Let m1 be the model for the relationship between runs and at bats which produces R2 of 37.29% Let m2 be the model for the relationship between runs and bat avg which produces R2 of 65.61%
Looking at the R2s of both models we can clearly see that the the R2 value of the model m2 is far greater than that of the model m2 so it is clear that the variable bat_avg predicts runs better than at bats.
After running the summary statistics for all the variables, the variable which best predicts the runs based on R2 happened to be bat_avg
plot(mlb11$runs ~ mlb11$bat_avg, main = "Relationship between Runs and Batting Avg", xlab = "Batting Avg", ylab = "Runs")
m4 <- lm(runs ~ bat_avg, data = mlb11)
abline(m4)
hist(m4$residuals)
qqnorm(m4$residuals)
qqline(m4$residuals)
Correlation Coefficient:
cor(mlb11$runs, mlb11$bat_avg)
## [1] 0.8099859
Summary statistics:
summary(m4)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
Examining the three newer variables: new_onbase, new_slug and new_obs
cor(mlb11$runs, mlb11$new_onbase)
## [1] 0.9214691
cor(mlb11$runs, mlb11$new_slug)
## [1] 0.9470324
cor(mlb11$runs, mlb11$new_obs)
## [1] 0.9669163
summary(lm(runs ~ new_onbase, data = mlb11))
##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.270 -18.335 3.249 19.520 69.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118.4 144.5 -7.741 1.97e-08 ***
## new_onbase 5654.3 450.5 12.552 5.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8437
## F-statistic: 157.6 on 1 and 28 DF, p-value: 5.116e-13
summary(lm(runs ~ new_slug, data = mlb11))
##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.80 68.71 -5.47 7.70e-06 ***
## new_slug 2681.33 171.83 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15
summary(lm(runs ~ new_obs, data = mlb11))
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
plot(mlb11$runs ~ mlb11$bat_avg, main = "Relationship between Runs and Batting Avg", xlab = "Batting Avg", ylab = "Runs")
m4 <- lm(runs ~ bat_avg, data = mlb11)
abline(m4)
After examining the summary statistics and correlation coefficients of all three new predictors new_onbase, new_slug and new_obs, the relationship between runs and new_obs variable has the highest R2 and coefficient correlation values and appears to be the best and most effective predictor of the runs.
Model diagnostics for the regression model with the best predictor bat_avg for runs
m5 <- lm(runs ~ new_obs, data = mlb11)
(1) Linearity:
The relationship looks linear based on a residual plot as the variability of residuals is approximately constant across the distribution but does not indicate any curvatures or any indication of non-normality.
plot(m5$residuals ~ mlb11$bat_avg)
abline(h = 0, lty = 3)
(2) Nearly normal residuals:
If the residuals are approximately normaly distributed then the normal quantile-quantile plot of the residuals will result in an approximately straight line.
As you can clearly see the normal quantile-quantile plot of the residuals indicates a pretty straight line so we can safely say that the residuals are approximately normaly distributed and the model meets the nearly normal residuals condition.
hist(m5$residuals)
qqnorm(m5$residuals)
qqline(m5$residuals)
(3) Constant variability:
Based on the plot the variability of points around the least squares line remains roughly constant so the condition constant variability has been met.