load("more/mlb11.RData")
runs
and one of the other numerical variables? Plot this relationship using the variable at_bats
as the predictor. Does the relationship look linear? If you knew a team’s at_bats
, would you be comfortable using a linear model to predict the number of runs?Scatter plot. Based on the plot below, the relationship looks linear. I would not have any problems using linear model to predict the number of runs.
plot(x = mlb11$at_bats, y = mlb11$runs)
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627
There is a positive linear relationship between those two variables. As at bat increases, number of run scored also increases. There are few unusual observations for ex, for at bat 5510 there are 860 runs scored. I have hightlited in the picture few points that are outliers or not normal.
plot_ss
, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?I ran the funciton many times. The samllest value I got is 28800. As seen in the picture I attached, the highlighted points will impact the value calculated.
m1 <- lm(runs ~ at_bats, data = mlb11)
homeruns
to predict runs
. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?plot_ss(x = mlb11$homeruns, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 415.239 1.835
##
## Sum of Squares: 73671.99
hr_run <- lm(runs ~ homeruns, data = mlb11)
summary(hr_run)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
\[ \hat{y} = 415.2389 + 1.8345 * homeruns \] The relationship between homeruns and runs is a positive linear one. For each home run, there is about 1.83 runs scored.
I would say the manager expects aroudn 700+ runs. Predicted value is-2789.2429 + 0.6305*5578
which is 727.68. This point has minus residual, it is below the regression line.
There is not any apparent pattern in the residual plot. Looks like points are spread above and below zero evenly with out any patterns. SO based on this we could say that there is a linear relationship between runs and at bats.
Yes based on the histogram and normal qq plot. The plot shows all points are close to the line.
Constant variability:
Yes, except on or two points. If you look at the residual plot, points are spread evenly above below zero.
1 Choose another traditional variable from mlb11
that you think might be a good predictor of runs
. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?
I have selected hits as the variable.
The relationship between hits and run is linear as seen by the plots below.
plot(x = mlb11$hits, y = mlb11$runs)
plot_ss(x = mlb11$hits, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -375.5600 0.7589
##
## Sum of Squares: 70638.75
onbase_run <- lm(runs ~ hits, data = mlb11)
summary(onbase_run)
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.718 -27.179 -5.233 19.322 140.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.5600 151.1806 -2.484 0.0192 *
## hits 0.7589 0.1071 7.085 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6292
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.043e-07
2 How does this relationship compare to the relationship between runs
and at_bats
? Use the R\(^2\) values from the two model summaries to compare. Does your variable seem to predict runs
better than at_bats
? How can you tell?
The relatioship between hit and run is very similar to the relationship between bats and run.
Hits is slighty better at predicting runs than at_bats based on R$^2$.
R$^2$ runs ~ hits = 0.6419
R$^2$ runs ~ at_bats = 0.6266
3 Now that you can summarize the linear relationship between two variables, investigate the relationships between runs
and each of the other five traditional variables. Which variable best predicts runs
? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).
Based on the r squared values, it looks like bat_avt is the best predicts runs followed by hits.
strikeouts multiple R-squared: 0.1694
hits multiple R-squared: 0.6419
at_bats multiple R-squared: 0.6266
strikeouts multiple R-squared: 0.1694
stolen_bases multiple R-squared: 0.002914
bat_avg multiple R-squared: 0.6561
homeruns multiple R-squared: 0.6266
plot(x = mlb11$hits, y = mlb11$runs)
onbase_run <- lm(runs ~ hits, data = mlb11)
summary(onbase_run)
##
## Call:
## lm(formula = runs ~ hits, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.718 -27.179 -5.233 19.322 140.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.5600 151.1806 -2.484 0.0192 *
## hits 0.7589 0.1071 7.085 1.04e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.23 on 28 degrees of freedom
## Multiple R-squared: 0.6419, Adjusted R-squared: 0.6292
## F-statistic: 50.2 on 1 and 28 DF, p-value: 1.043e-07
plot(onbase_run$residuals ~ mlb11$hits)
abline(h = 0, lty = 3)
4 Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs
? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?
new_obs is the best predictor of run based on the r-squred values. This makes sense a team that hig on base pecertange and good sluggers will score more runs.
new_onbase multiple R-squared: 0.8491
new_slug multiple R-squared: 0.8969
new_obs multiple R-squared: 0.9349
plot(x = mlb11$new_obs, y = mlb11$runs)
onbase_run <- lm(runs ~ new_obs, data = mlb11)
summary(onbase_run)
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
plot(onbase_run$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)
5 Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.
plot(x = mlb11$hits, y = mlb11$runs)
plot_ss(x = mlb11$hits, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -375.5600 0.7589
##
## Sum of Squares: 70638.75
onbase_run <- lm(runs ~ hits, data = mlb11)
plot(onbase_run$residuals ~ mlb11$hits)
abline(h = 0, lty = 3)
hist(onbase_run$residuals)
qqnorm(onbase_run$residuals)
qqline(onbase_run$residuals)
Constant variability : Based on the residuals plot, the points are evenly spread above and below zero with out any apparent patterns.
Nomral : Based on the histogram and qq plot, residuals are nearlly normally distributed.
Linearity: Baed on the scatterplot the relationship between the variable is a positive linear one. Residuals plot doesnt show any patterns.
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by the faculty and TAs of UCLA Statistics.