download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
summary(mlb11)
team runs at_bats hits
Arizona Diamondbacks: 1 Min. :556 Min. :5417 Min. :1263
Atlanta Braves : 1 1st Qu.:629 1st Qu.:5448 1st Qu.:1348
Baltimore Orioles : 1 Median :706 Median :5516 Median :1394
Boston Red Sox : 1 Mean :694 Mean :5524 Mean :1409
Chicago Cubs : 1 3rd Qu.:734 3rd Qu.:5575 3rd Qu.:1441
Chicago White Sox : 1 Max. :875 Max. :5710 Max. :1600
(Other) :24
homeruns bat_avg strikeouts stolen_bases
Min. : 91 Min. :0.233 Min. : 930 Min. : 49.0
1st Qu.:118 1st Qu.:0.245 1st Qu.:1085 1st Qu.: 89.8
Median :154 Median :0.253 Median :1140 Median :107.0
Mean :152 Mean :0.255 Mean :1150 Mean :109.3
3rd Qu.:173 3rd Qu.:0.260 3rd Qu.:1248 3rd Qu.:130.8
Max. :222 Max. :0.283 Max. :1323 Max. :170.0
wins new_onbase new_slug new_obs
Min. : 56 Min. :0.292 Min. :0.348 Min. :0.640
1st Qu.: 72 1st Qu.:0.311 1st Qu.:0.377 1st Qu.:0.692
Median : 80 Median :0.318 Median :0.399 Median :0.716
Mean : 81 Mean :0.321 Mean :0.399 Mean :0.719
3rd Qu.: 90 3rd Qu.:0.328 3rd Qu.:0.413 3rd Qu.:0.738
Max. :102 Max. :0.349 Max. :0.461 Max. :0.810
I would use Scatterplot to present two numerical variables simultaneously because it permits the relationship between the variables to be examined with ease. The relationship is positive but only moderately strong. I will not be very comfortable using a linear model to predict the number of runs.
plot(mlb11$runs ~ mlb11$at_bats, main = "Relationship")
We can tell the moderate strong relationship of these two variables by looking at the calculated correlation coefficient of 0.610627. If the relationship is strong and positive, the correlation will be near +1 and 0.610627 is not that close. I am also seeing a few positive outliers in the plot that could skew the line. Looks like most of the teams at around 5,600 or less at_bats.
cor(mlb11$runs, mlb11$at_bats)
correlation [1] 0.6106
I ran the plot using plot_ss 5 times and the best result for the sum of squares i got was 127,559. I can compare the result with the R generated sum of squares which is not too terribly far apart.
Call: lm(formula = y ~ x, data = pts)
Coefficients:
(Intercept) x
-2044.9172 0.4951 cor(mlb11$runs, mlb11$at_bats)
Sum of Squares: 127559.4
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.243 0.631
##
## Sum of Squares: 123722
y = b0 + b1x = 415.2389 + 1.8345*homeruns
In term of the relationship between success of a team and it home run, it seems that for every home run a team has the average number of total runs will also increase by 1.83. This is a positive relationship with a correlation coefficient of 0.7916, which is relatively strong.
cor(mlb11$runs, mlb11$homeruns)
[1] 0.7916
plot_ss(x = mlb11$homeruns, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 415.24 1.83
##
## Sum of Squares: 73672
m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)
correlation
correlation Call:
correlation lm(formula = runs ~ homeruns, data = mlb11)
correlation
correlation Residuals:
correlation Min 1Q Median 3Q Max
correlation -91.61 -33.41 3.23 24.29 104.63
correlation
correlation Coefficients:
correlation Estimate Std. Error t value Pr(>|t|)
correlation (Intercept) 415.239 41.678 9.96 1.0e-10 ***
correlation homeruns 1.835 0.268 6.85 1.9e-07 ***
correlation ---
correlation Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
correlation
correlation Residual standard error: 51.3 on 28 degrees of freedom
correlation Multiple R-squared: 0.627, Adjusted R-squared: 0.613
correlation F-statistic: 47 on 1 and 28 DF, p-value: 1.9e-07
Based on the formula for least squares regression line for the linear model below the estimated runs for a team with 5578 at_bats are 730.5. Looking at the actual observed data there is no team with 5578 at_bats, but Philadelphia Phillies has a at_bats of 5,579 with 713 runs. Using these two numbers we can see that the model overestimated the runs by 730.5 - 713 = 17.5.
b0 <- -2789.243
b1 <- 0.631
x <- 5578
Yhat <- b0 + b1*x
Yhat
Y hat [1] 730.5
mlb11[order(mlb11$runs,mlb11$at_bats),]
observed Y team runs at_bats hits homeruns bat_avg strikeouts
observed Y 30 Seattle Mariners 556 5421 1263 109 0.233 1280
observed Y 28 San Francisco Giants 570 5486 1327 121 0.242 1122
observed Y 29 San Diego Padres 593 5417 1284 91 0.237 1320
observed Y 23 Pittsburgh Pirates 610 5421 1325 107 0.244 1308
observed Y 10 Houston Astros 615 5598 1442 95 0.258 1164
observed Y 21 Minnesota Twins 619 5487 1357 103 0.247 1048
observed Y 27 Washington Nationals 624 5441 1319 154 0.242 1323
observed Y 22 Florida Marlins 625 5508 1358 149 0.247 1244
observed Y 26 Atlanta Braves 641 5528 1345 173 0.243 1260
observed Y 12 Los Angeles Dodgers 644 5436 1395 117 0.257 1087
observed Y 24 Oakland Athletics 645 5452 1330 114 0.244 1094
observed Y 17 Chicago White Sox 654 5502 1387 154 0.252 989
observed Y 13 Chicago Cubs 654 5549 1423 148 0.256 1202
observed Y 15 Los Angeles Angels 667 5513 1394 155 0.253 1086
observed Y 18 Cleveland Indians 704 5509 1380 154 0.250 1269
observed Y 25 Tampa Bay Rays 707 5436 1324 172 0.244 1193
observed Y 11 Baltimore Orioles 708 5585 1434 191 0.257 1120
observed Y 16 Philadelphia Phillies 713 5579 1409 153 0.253 1024
observed Y 6 New York Mets 718 5600 1477 108 0.264 1085
observed Y 8 Milwaukee Brewers 721 5447 1422 185 0.261 1083
observed Y 4 Kansas City Royals 730 5672 1560 129 0.275 1006
observed Y 19 Arizona Diamondbacks 731 5421 1357 172 0.250 1249
observed Y 9 Colorado Rockies 735 5544 1429 163 0.258 1201
observed Y 14 Cincinnati Reds 735 5612 1438 183 0.256 1250
observed Y 20 Toronto Blue Jays 743 5559 1384 186 0.249 1184
observed Y 5 St. Louis Cardinals 762 5532 1513 162 0.273 978
observed Y 3 Detroit Tigers 787 5563 1540 169 0.277 1143
observed Y 1 Texas Rangers 855 5659 1599 210 0.283 930
observed Y 7 New York Yankees 867 5518 1452 222 0.263 1138
observed Y 2 Boston Red Sox 875 5710 1600 203 0.280 1108
observed Y stolen_bases wins new_onbase new_slug new_obs
observed Y 30 125 67 0.292 0.348 0.640
observed Y 28 85 86 0.303 0.368 0.671
observed Y 29 170 71 0.305 0.349 0.653
observed Y 23 108 72 0.309 0.368 0.676
observed Y 10 118 56 0.311 0.374 0.684
observed Y 21 92 63 0.306 0.360 0.666
observed Y 27 106 80 0.309 0.383 0.691
observed Y 22 95 72 0.318 0.388 0.706
observed Y 26 77 89 0.308 0.387 0.695
observed Y 12 126 82 0.322 0.375 0.697
observed Y 24 117 74 0.311 0.369 0.680
observed Y 17 81 79 0.319 0.388 0.706
observed Y 13 69 71 0.314 0.401 0.715
observed Y 15 135 86 0.313 0.402 0.714
observed Y 18 89 80 0.317 0.396 0.714
observed Y 25 155 91 0.322 0.402 0.724
observed Y 11 81 69 0.316 0.413 0.729
observed Y 16 96 102 0.323 0.395 0.717
observed Y 6 130 77 0.335 0.391 0.725
observed Y 8 94 96 0.325 0.425 0.750
observed Y 4 153 71 0.329 0.415 0.744
observed Y 19 133 94 0.322 0.413 0.736
observed Y 9 118 73 0.329 0.410 0.739
observed Y 14 97 79 0.326 0.408 0.734
observed Y 20 131 81 0.317 0.413 0.730
observed Y 5 57 90 0.341 0.425 0.766
observed Y 3 49 95 0.340 0.434 0.773
observed Y 1 143 96 0.340 0.460 0.800
observed Y 7 147 97 0.343 0.444 0.788
observed Y 2 102 90 0.349 0.461 0.810
m1 <- lm(runs ~ at_bats, data = mlb11)
plot(mlb11$runs ~ mlb11$at_bats, main = "Relationship")
abline(m1)
summary(m1)
Call:
lm(formula = runs ~ at_bats, data = mlb11)
Residuals:
Min 1Q Median 3Q Max
-125.6 -47.0 -16.6 54.4 176.9
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2789.243 853.696 -3.27 0.00287 **
at_bats 0.631 0.155 4.08 0.00034 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 66.5 on 28 degrees of freedom
Multiple R-squared: 0.373, Adjusted R-squared: 0.35
F-statistic: 16.6 on 1 and 28 DF, p-value: 0.000339
The residuals show no obvious patterns and appear to be scattered randomly around the dashed line that represents 0. I would say that the relationship is linear.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
It looks nearly normal to me.
hist(m1$residuals)
qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot
The constant variability condition calls for the variability of points around the least squares line remains roughly constant. Based on the plots we have done it looks to me this condition has been met.
###. 1. Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?
Since we already looked at the relationship between runs and homeruns and runs and at_bat I chose runs and bat_avg to see if it is a good predictor. F rom the plot and summary statistics below it looks to me that the two variables fit a liner model.
y = b0 + b1X = -642.8+5242.2*bat_avg
m3 <- lm(runs ~ bat_avg, data = mlb11)
plot(mlb11$runs ~ mlb11$bat_avg, main = "Relationship3")
abline(m3)
summary(m3)
Call:
lm(formula = runs ~ bat_avg, data = mlb11)
Residuals:
Min 1Q Median 3Q Max
-94.7 -26.3 -5.5 28.5 131.1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -643 183 -3.51 0.0015 **
bat_avg 5242 717 7.31 5.9e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 49.2 on 28 degrees of freedom
Multiple R-squared: 0.656, Adjusted R-squared: 0.644
F-statistic: 53.4 on 1 and 28 DF, p-value: 5.88e-08
R2 measure of how close the data are to least squares line. 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. comparing the R2 data for runs and at-bats and runs and bat_avg it seems that the latter predict runs better because the R2 for bat_avg is 0.6561 vs. 0.3729 forat_abts. This indicates that 65.61% of variability can be explained by the model.
after running summary statistics for all other traditional variables it turns out that the best variable to predict the runs is bat_avg. It has the highest r2 value.
m3 <- lm(runs ~ bat_avg, data = mlb11)
hist(m3$residuals)
qqnorm(m3$residuals)
qqline(m3$residuals) # adds diagonal line to the normal prob plot
If I don't know anything about baseball but only have the following summary statistics to predict which new variable is the most effective at predicting run I would pick new_obs. The R-squared for new_obs is at a high 93.5%!!
names(mlb11)
[1] "team" "runs" "at_bats" "hits"
[5] "homeruns" "bat_avg" "strikeouts" "stolen_bases"
[9] "wins" "new_onbase" "new_slug" "new_obs"
mnew_obs <- lm(runs ~ new_obs, data = mlb11)
mnew_slug <- lm(runs ~ new_slug, data = mlb11)
mnew_onbase <- lm(runs ~ new_onbase, data = mlb11)
summary(mnew_obs)
Call:
lm(formula = runs ~ new_obs, data = mlb11)
Residuals:
Min 1Q Median 3Q Max
-43.46 -13.69 1.16 13.94 41.16
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -686.6 68.9 -9.96 1e-10 ***
new_obs 1919.4 95.7 20.06 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.4 on 28 degrees of freedom
Multiple R-squared: 0.935, Adjusted R-squared: 0.933
F-statistic: 402 on 1 and 28 DF, p-value: <2e-16
summary(mnew_slug)
Call:
lm(formula = runs ~ new_slug, data = mlb11)
Residuals:
Min 1Q Median 3Q Max
-45.41 -18.66 -0.91 16.29 52.29
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -375.8 68.7 -5.47 7.7e-06 ***
new_slug 2681.3 171.8 15.60 2.4e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 27 on 28 degrees of freedom
Multiple R-squared: 0.897, Adjusted R-squared: 0.893
F-statistic: 244 on 1 and 28 DF, p-value: 2.42e-15
summary(mnew_onbase)
Call:
lm(formula = runs ~ new_onbase, data = mlb11)
Residuals:
Min 1Q Median 3Q Max
-58.27 -18.33 3.25 19.52 69.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1118 144 -7.74 2.0e-08 ***
new_onbase 5654 450 12.55 5.1e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 32.6 on 28 degrees of freedom
Multiple R-squared: 0.849, Adjusted R-squared: 0.844
F-statistic: 158 on 1 and 28 DF, p-value: 5.12e-13
mnew_obs <- lm(runs ~ new_obs, data = mlb11)
hist(mnew_obs$residuals)
qqnorm(mnew_obs$residuals)
qqline(mnew_obs$residuals) # adds diagonal line to the normal prob plot
avglunch <- .308
avghelmet <- .388
sdlunch <-.267
sdhelmet <-.169
data <-data.frame(avglunch,avghelmet,sdlunch,sdhelmet)
data
avglunch avghelmet sdlunch sdhelmet
1 0.308 0.388 0.267 0.169
If the R2 for the least-squares regression line is 72% the correlation coefficient is about -.85. It indicates strong negative linear relationship between the explanatory and response variables. 72% of the variation in the average percentage of bike riders wearing helmets can be explained by the explanatory variable of average percent of children receiving reduced-fee lunches.
R2 <- .72
sqrt(R2)*-1
R [1] -0.8485
The slope and intercept for the least-squares regression line is y=0.5534-0.5371X
Sx <- .267
Sy <- .169
R<- sqrt(.72)*-1
beta1 = Sy/Sx*R
beta1
Slope [1] -0.5371
Meanx <- .308
Meany <- .388
beta0 = Meany - beta1 * Meanx
beta0
Intercept [1] 0.5534
The intercept is the expected mean value of Y when all X=0. In this context, the y intercept of the average bikers wearing helmets is 55.34 % when the average percent of children receiving reduced-fee lunches is at 0. For this question, the y-intercept is relevant, since there are children who do not receive any reduced-fee lunches.
The slope is -.5370843. What this mean is that for every decrease of 1% in x there is a decrease of 0.5370842*.01 in y.
We will first compute the predicted value of y (yhat). Then we will use the residual formula ex = y - yhat. Based on the computation the linear model estimated 33.86% bike riders wear helmets when 40% of the children receive reduced-fee lunches. The residual is the difference between observed and expected value, which is .06141 in this context.
x <- .40
y<-.4
beta0
[1] 0.5534
beta1
[1] -0.5371
yhat <-beta0+beta1*x
yhat
[1] 0.3386
ex <- y-yhat
ex
[1] 0.06141
We can tell the strong positive relationship of these two variables by visually insert a least squares line among the data. There are very few positive outliers in the plot that may or may not skew the line.
y = beta0 + beta1*X = -105.0113 + 1.0176*X
The y-intercept has a negative value of -105.0113, which is the value where it intercept with y when x is 0. In this context I do not think the y-intercept is usable because a person can not be at 0 centimeter height. The slope of the model is 1.0176, which means that for every centimeter the height goes up, the weight will go up by 1.0176.
The sample statistics given show a p-value of 0.0000. The p-value is so small that we reject the null hypothesis. The data provides strong evidence that an increase in height is associated with an increase in weight.
The R2 is .5184 or 51.84% of the response variable weight can be explained by the explanatory variable Height.
R <- .72
R2 <- R^2
R2
R2 [1] 0.5184