Let’s load up the data for the 2011 season.
runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?ggplot(mlb11, aes(x = runs, y = at_bats)) +
geom_point(color='red') +
xlab('No of runs') +
ylab('No of at bats') +
ggtitle('runs vs at_bats') I used a scatterplot and we can say that there appears to be a linear relationship between at_bats and runs variables.
If the relationship looks linear, we can quantify the strength of the relationship with the correlation coefficient.
## [1] 0.610627
Answer -There is a positive relationship between runs and at_bats - 0.610627 means a positive correlation but a weak one -There appears to be a great deal of deviation amongst the teams and the points are not centered around a specific value
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
After running this command, you’ll be prompted to click two points on the plot to define a line. Once you’ve done that, the line you specified will be shown in black and the residuals in blue. Note that there are 30 residuals, one for each of the 30 observations. Recall that the residuals are the difference between the observed values and the values predicted by the line:
\[ e_i = y_i - \hat{y}_i \]
The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE.
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
Note that the output from the plot_ss function provides you with the slope and intercept of your line as well as the sum of squares.
plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?Answer:Smallest sum of squares was 123721.9
It is rather cumbersome to try to get the correct least squares line, i.e. the line that minimizes the sum of squared residuals, through trial and error. Instead we can use the lm function in R to fit the linear model (a.k.a. regression line).
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
Let’s consider this output piece by piece. First, the formula used to describe the model is shown at the top. After the formula you find the five-number summary of the residuals. The “Coefficients” table shown next is key; its first column displays the linear model’s y-intercept and the coefficient of at_bats. With this table, we can write down the least squares regression line for the linear model:
\[ \hat{y} = -2789.2429 + 0.6305 * atbats \]
homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
y^=-2789.2429+0.6305∗at_bats
Let’s create a scatterplot with the least squares line laid on top.
## [1] 727.6861
To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a # is intended to be a comment that helps understand the code but is ignored by R.
Nearly normal residuals: To check this condition, we can look at a histogram
or a normal probability plot of the residuals.
Constant variability:
mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
## [1] 0.8099859
There is a positive linear relationship between bat_avg and runs.
runs and at_bats? Use the R\(^2\) values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tellm1 = lm(runs ~ at_bats, data = mlb11)
m3 = lm(runs ~ stolen_bases, data = mlb11)
rSquaredAtBats = summary(m1)$r.squared
rSquaredStBases = summary(m3)$r.squared
rSquaredAtBats## [1] 0.3728654
## [1] 0.002913993
The R2 value for variable at_bats is 0.3728654 and the R2 value for variable stolen_bases is 0.002914.R2 value of at_bats is higher than R2 value of stolen_bases, which means we can at_bats is a better predictor than stolen_bases.
runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).## [1] 0.6560771
runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?model1 <- lm(formula = runs ~ new_onbase, mlb11)
ggplot(mlb11, aes(x=new_onbase, y=runs)) +
geom_point(color = "red") +
geom_line(aes(x=new_onbase, y=predict(model1, newdata = mlb11)), color = "blue")##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.270 -18.335 3.249 19.520 69.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118.4 144.5 -7.741 1.97e-08 ***
## new_onbase 5654.3 450.5 12.552 5.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8437
## F-statistic: 157.6 on 1 and 28 DF, p-value: 5.116e-13
new_model1 <- lm(formula = runs ~ new_slug, mlb11)
ggplot(mlb11, aes(x=new_slug, y=runs)) +
geom_point(color = "red") +
geom_line(aes(x=new_slug, y=predict(new_model1, newdata = mlb11)), color = "blue")##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.80 68.71 -5.47 7.70e-06 ***
## new_slug 2681.33 171.83 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15
obs_model <- lm(formula = runs ~ new_obs, mlb11)
ggplot(mlb11, aes(x=new_obs, y=runs)) +
geom_point(color = "red") +
geom_line(aes(x=new_obs, y=predict(obs_model, newdata = mlb11)), color = "blue")##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
From within these 3 new variables, the maximum R-quared and the minimum p-value are for the variable : new_obs, which are 0.9349 and 2.2e-16 respectively. Hence the best predictor of runs from all the variables is the variable : new_obs.
new_obs_model <- lm(runs ~ new_obs, data = mlb11)
ggplot(mlb11, aes(x = new_obs, y = new_obs_model$residuals)) +
geom_point() +
geom_hline(yintercept = 0, linetype = 3) The relationship does not indicate any curvatures or abnormalities.It looks linear.