load("more/mlb11.RData")runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?I would use a scatterplot to display the relationship between two numeric variables.
plot(mlb11$at_bats,mlb11$runs)
abline(lm(runs~at_bats,data=mlb11),col = "red")The relationship looks weakly linear. A linear model would be reasonable to use at bats to predict the number of runs.
There is a weak positive linear relationship between at bats and runs. There is one outlier who has over 850 runs while having less than 5550 at bats.
plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?plot_ss(x = mlb11$at_bats, y = mlb11$runs)## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
The smallest sum of squares is 123721.9.
homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?m1 <- lm(runs ~ at_bats, data = mlb11)
m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
\[ \hat{y} = 415.2389 + 1.8345 * homeruns \]
According to the slope of the model, as homeruns increase by 1, runs increase by 1.8345.
unname(m1$coefficients[1] + m1$coefficients[2] * 5578)## [1] 727.965
mlb11[mlb11$at_bats == 5579,]## team runs at_bats hits homeruns bat_avg strikeouts
## 16 Philadelphia Phillies 713 5579 1409 153 0.253 1024
## stolen_bases wins new_onbase new_slug new_obs
## 16 96 102 0.323 0.395 0.717
The manager would predict 727.965 runs from a team with 5,578 at-bats. The team with the closest number of at-bats in this dataset is the Philidelphia Phillies with 5,579 at-bats. The Phillies had 713 runs, so this prediction would be an overestimate. The residual is 14.965.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0There does not appear to be any particular pattern in the residual plot. This indicates that the linearity assumption is satisfied between runs and at-bats.
hist(m1$residuals)qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plotThe histogram looks closely normal. The points do not deviate too much from the QQ line, which is a good indication of normality of the residuals.
Since the variance remains constant as the number of at-bats changes, the constant variability condition appears to be met.
mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?m3 = lm(runs~bat_avg,data = mlb11)
summary(m3)##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
plot(mlb11$bat_avg,mlb11$runs)
abline(m3, col = "red")There seems to be a linear relationship between batting average and runs.
runs and at_bats? Use the R\(^2\) values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?summary(m1)##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
summary(m3)##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
There appears to be a stronger relationship between batting average and runs comapred to at-bats and runs. The R-squared value of model 3 is 0.6561 while the value for model 1 is 0.3729. Since the R-squared value is higher for batting average than for at-bats, it is a better predictor of runs.
runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).library(corrplot)## corrplot 0.84 loaded
corrplot.mixed(cor(mlb11[,2:12]))hist(m3$residuals)plot(m3$residuals)
abline(h = 0)qqnorm(m3$residuals)
qqline(m3$residuals)Out of all the traditional variables, batting average best predicts runs. The correlation value for batting average is highest compared to all other traditional variables. The residuals for this model are normal, random, and have constant variance.
m4 = lm(runs~new_obs, data = mlb11)
summary(m4)##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
This model is a good model.