In this lab we’ll be looking at data from all 30 Major League Baseball teams and examining the linear relationship between runs scored in a season and a number of other player statistics. Our aim will be to summarize these relationships both graphically and numerically in order to find which variable, if any, helps us best predict a team’s runs scored in a season.
download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
Seven traditionally used variables in the data set: at-bats, hits, home runs, batting average, strikeouts, stolen bases, and wins. There are also three newer variables: on-base percentage, slugging percentage, and on-base plus slugging. For the first portion of the analysis we’ll consider the seven traditional variables.
Q.What type of plot would you use to display the relationship between runs and one of the other numerical variables?
A. A dot plot can be best used to display the relationships of these numerical variables.
Q. Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team’s at_bats, would you be comfortable using a linear model to predict the number of runs?
A. It seems that the variables increase rate though it is hard to tell by the plot it seems somewhat linear .
cor(mlb11$runs, mlb11$at_bats)
## [1] 0.610627
plot(mlb11$runs~ mlb11$at_bats,main= "Runs and at_bats", ylab = "Runs", xlab="at_runs")
In describing a distribution of a single variable we must consider characteristics such as center, spread, and shape.
Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.Just as we used the mean and standard deviation to summarize a single variable, we can summarize the relationship between these two variables by finding the line that best follows their association.
Q.Use the following interactive function to select the line that you think does the best job of going through the cloud of points.
plot_ss(x = mlb11$at_bats, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
-The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, you can rerun the plot command and add the argument showSquares = TRUE.
plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
Q. Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?
A. Above code was run a total of 8 times the smallest sum of squares that is generates is 123,721.9. neighbors?
-We can use the lm function in R to fit the linear model (a.k.a. regression line).
m1 <- lm(runs ~ at_bats, data = mlb11)
m1
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Coefficients:
## (Intercept) at_bats
## -2789.2429 0.6305
#output lm is an object that contains all of the information we need to about linear model
summary(m1)
##
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -125.58 -47.05 -16.59 54.40 176.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2789.2429 853.6957 -3.267 0.002871 **
## at_bats 0.6305 0.1545 4.080 0.000339 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared: 0.3729, Adjusted R-squared: 0.3505
## F-statistic: 16.65 on 1 and 28 DF, p-value: 0.0003388
ŷ =−2789.2429+0.6305∗atbats
The output also generates the R^2 value r-sqaured value which represents proportion of variability in the response variable that is explained by the explanatory variable .
R-square= 37.3% of variability explained by at_bats.
Q. Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?
m2 <- lm(runs ~ homeruns, data = mlb11)
m2
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Coefficients:
## (Intercept) homeruns
## 415.239 1.835
summary(m2)
##
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.615 -33.410 3.231 24.292 104.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 415.2389 41.6779 9.963 1.04e-10 ***
## homeruns 1.8345 0.2677 6.854 1.90e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared: 0.6266, Adjusted R-squared: 0.6132
## F-statistic: 46.98 on 1 and 28 DF, p-value: 1.9e-07
ŷ=415.239+1.835*homerun
R-square= 62.66% of variability explained by homeruns.
# Scatterplot with the least squares line laid on top(runs and at_bats)
plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)
This line can be used to predict y at any value of x. When predictions are made for values of x that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.
Q.If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats?
atbat1<-5578
mr1<-(-2789.2429+0.6305* atbat1)
mr1
## [1] 727.6861
Q.Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?
Philadelphia Phillies - runs:713, at_bats 5579
In comparison the the estimated value is an overestimate by 14 runs.
we need to check for-
Linearity: You already checked if the relationship between runs and at-bats is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. at-bats. Recall that any code following a # is intended to be a comment that helps understand the code but is ignored by R.
plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
Q. Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?
A. There is no apparent pattern about the dashed line though it may show a slight skew to the left due to the concentration of the dots to one side than the other. This indicates that there is a linear relationship between the runs and atbats.
Nearly normal residuals: To check this condition, we can look at a histogram.
hist(m1$residuals)
# normal probability plot of residuals
qqnorm(m1$residuals)
qqline(m1$residuals)
Q.Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?
A. Both the histogram and the plot indicate to somewhat meet the nearly normal residual condition.
Q.Based on the plot in (1), does the constant variability condition appear to be met?
Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatter plot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?
Formula : ŷ=-642.8+5242.2*bat_avg
Scatter plot:
# quantify the strength of the relationship
cor(mlb11$runs, mlb11$bat_avg)
## [1] 0.8099859
plot_ss(x = mlb11$bat_avg, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -642.8 5242.2
##
## Sum of Squares: 67849.52
plot_ss(x = mlb11$bat_avg, y = mlb11$runs, showSquares = TRUE)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -642.8 5242.2
##
## Sum of Squares: 67849.52
How does this relationship compare to the relationship between runs and at_bats? Use the R2 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?
m3<-lm(runs ~ bat_avg, data = mlb11)
summary(m3)
##
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94.676 -26.303 -5.496 28.482 131.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -642.8 183.1 -3.511 0.00153 **
## bat_avg 5242.2 717.3 7.308 5.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared: 0.6561, Adjusted R-squared: 0.6438
## F-statistic: 53.41 on 1 and 28 DF, p-value: 5.877e-08
Formula : ŷ=-642.8+5242.2*bat_avg
r^2: For this model 65.61% of the variability in runs is explained by the bat_avg.
at_bats has a 37% varability in runs slightly lower than the 65 % r^2 value, therefore the batting averages predicts runs better than that of at_bats.
** Infact the batting averages is the best variable in predicting the runs than any of the above in the data set.
Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we’ve discussed (for the sake of conciseness, only include output for the best variable, not all five).
plot(m1$residuals ~ mlb11$bat_avg)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
hist(m1$residuals)
#normal plot
qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot
A. Based on the histogram and the normal plot the condition of nearly normal condition appears to be met.Futher constant variability condition appears to be met.
Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success.
1.new_slug 2.new_onbase 3.New_obs
Find the strength of the liner relationship.
New slug
#new slug
cor(mlb11$runs, mlb11$new_slug)
## [1] 0.9470324
m4<-lm(runs ~ new_slug, data = mlb11)
summary(m4)
##
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.41 -18.66 -0.91 16.29 52.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -375.80 68.71 -5.47 7.70e-06 ***
## new_slug 2681.33 171.83 15.61 2.42e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared: 0.8969, Adjusted R-squared: 0.8932
## F-statistic: 243.5 on 1 and 28 DF, p-value: 2.42e-15
new _onbase
#new_onbase
cor(mlb11$runs, mlb11$new_onbase)
## [1] 0.9214691
m5<-lm(runs ~ new_onbase , data = mlb11)
summary(m5)
##
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.270 -18.335 3.249 19.520 69.002
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1118.4 144.5 -7.741 1.97e-08 ***
## new_onbase 5654.3 450.5 12.552 5.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared: 0.8491, Adjusted R-squared: 0.8437
## F-statistic: 157.6 on 1 and 28 DF, p-value: 5.116e-13
new_obs
#Hits
cor(mlb11$runs, mlb11$new_obs)
## [1] 0.9669163
m6<-lm(runs ~ new_obs, data = mlb11)
summary(m6)
##
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.456 -13.690 1.165 13.935 41.156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -686.61 68.93 -9.962 1.05e-10 ***
## new_obs 1919.36 95.70 20.057 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared: 0.9349, Adjusted R-squared: 0.9326
## F-statistic: 402.3 on 1 and 28 DF, p-value: < 2.2e-16
1.In general, are they more or less effective at predicting runs that the old variables?2.Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs?
Based on the R-squared vales in comparrison and corretion vlaues they are a much better indicator or predictor of runs than the old variables.
2.Explain using appropriate graphical and numerical evidence. Of all ten variables we’ve analyzed, which seems to be the best predictor of runs?
We know that the vaibles that best indicate or predict runs are in the new variable family and of those new_obs is the best which makes it better predictor than all the variables in the data set.
3.Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense
Check the model diagnostics for the regression model with the variable you decided was the best predictor for runs.
#Lineaity:
plot(m1$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
hist(m1$residuals)
#normal plot
qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot
hist(m1$residuals)
#normal plot
qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot