load(url("http://www.openintro.org/stat/data/mlb11.RData"))
Here is the text for the first question.
What is the name of the data set you just loaded?
The data set is called “mlb11”.
Here is the text for the third question.
What shape does the relationship between X = at bats and Y = runs need to be in order for us to reasonably use least squares linear regression?
In order to reasonably use a least squares regression line, the data would need to be represented by at least a fairly linear trend in a scatterplot.
Here is the text for the fourth question.
What type of plot would you use to determine the shape of the relationship between X = at bats and Y = runs? Why?
I believe that a scatterplot would serve as the optimal plot type to show the shape of the relationship between at bats and runs because a scatterplot shows the relationship between the independent variable, x-axis (at bats), and the dependent variable, y axis (runs), and the intrepreter would be able to visually see where each team’s run count lies in relation to their total at bats.
Here is the plot for the fifth question.
plot(x = mlb11$at_bats, y = mlb11$runs, xlab = "Total At Bats", ylab = "Total Runs")
Here is the text for the sixth question.
Does it seem reasonable to consider a regression line to describe the relationship between X = at bats and Y= runs? Explain.
I would definitely say so. The relationship, while not overwhelmingly strong, appears to be postive and moderately correlated in a linear direction so a least squares line would be appropriate for modeling the data.
cor(mlb11$at_bats, mlb11$runs)
## [1] 0.610627
Given the correlation value of 0.610627, we are able to see that there is a moderate positive linear relationship between at bats and runs.
Here is the text for the seventh question.
What is the correlation between X = at bats and Y= runs? What does this tell us about the strength of the linear relationship between these two variables? Hint: You need to replace firstvariable and secondvariable with your two variables of interest.
As shown above, the correlation between at bats and runs is 0.610627.
Here is the text for the eighth question.
Describe the relationship between X = at bats and Y= runs. Make sure to comment on all four things listed above. Would you be comfortable telling the client that a least squares linear regression model is a reasonable choice to describe the relationship between at bats and runs?
The relationship is linear, moderately strong, positive, and there are a three possible outliers in the scatterplot.
m1 <- lm(runs ~at_bats, data = mlb11)
m1$coefficients
## (Intercept) at_bats
## -2789.24289 0.63055
Here is the text for the ninth question.
Write down the LSLR line.
ŷ = -2789.24 + 0.6306x
Here is the text for the tenth question.
Think back to your plot. Does it make sense that the slope coefficient is positive? Why or why not?
It makes a lot of sense for the slope of the LSRL to be positive due to the scatterplot above showing a moderately strong positive linear relationship. A positive linear relationship in a scatterplot should also translate into a positive slope in a linear regression line.
plot(x= mlb11$at_bats, y = mlb11$runs, xlab = "At Bats", ylab = "Runs")
abline(m1)
plot_ss(x = mlb11$at_bats, y = mlb11$runs)
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## -2789.2429 0.6305
##
## Sum of Squares: 123721.9
Here is the text for the eleventh question.
In words, what do the residuals represent? When we fit least-squares regression lines, do we want the residuals to be small or large? Explain.
The residuals represent the difference between the actual value (runs) and estimated value (runs) predicted by the LSRL with respect to total at bats. When running a LSRL, we ideally would like for our residuals to be as small as possible because small residuals signal that the LSRL is more accurate.
Here is the text for the twelfth question.
Use the least squares line to answer the following. If a team manager saw the least squares regression line and not the actual data, about how many runs would they predict for a team with 5579 at-bats? Is this estimate an overestimate or an underestimate of the true value?
ŷ = -2789.24 + 0.6306x => ŷ = -2789.24 + 0.6306(5579) = 728.88 => The LSRL would predict about 729 runs (728.88) scored after 5579 at bats. Given that the actual run value after 5579 at bats is 713, the residual for this estimate (713 - 728.88) would be -15.88 showing that this is an overestimate of approximately 16 runs.
Here is the text for the thirteenth question.
What is the residual for the prediction of runs for a team with 5579 at-bats?
As shown above, the residual for the estimate of runs for a team with 5579 at bats is -15.88 (713-728.88).
cor( mlb11$at_bats, mlb11$runs)^2
## [1] 0.3728654
Here is the text for the fourteenth question.
What percentage of the variability in runs is associated with the number of times a team was at bat?
37.3% of the variation within the dependent variable (runs) is explained by the independent variable (at_bats).